Course Learning Notes | Md. Abu Bokkor Shiddik

Topics Covered

What is Statistics History & Origins Definitions & Classifications Uses, Importance & Limitations Sources of Statistical Data Data Processing & Preprocessing Central Tendency Measures of Dispersion Index Numbers Time Series Basics Correlation Regression Analysis of Attributes Shape Characteristics Bivariate Distribution

Foundation

What is Statistics?

📖

What it is

The Science of Data

Statistics is the science of collecting, organising, analysing, interpreting, and presenting data to make informed decisions and draw conclusions under uncertainty.

💡

Two Branches

Descriptive vs Inferential

Descriptive: Summarises & describes data (means, charts, tables)
Inferential: Draws conclusions about a population from a sample using probability

✅

Where to Use

Applications

Medical research & clinical trials
Economics & finance forecasting
Government census & planning
Machine learning & AI systems
Agriculture & environmental studies

⚠️

Where NOT to Use

Cautions

Predicting individuals with certainty
When data quality is very poor
Proving causation from correlation alone
Non-homogeneous data without caution

Key Quote"Statistics is the grammar of science." — Karl Pearson. It converts raw numbers into knowledge.

· · ·

Background

History & Origins

🏛️

Ancient Roots

Early Beginnings

Babylonians collected census data ~3000 BCE
Egyptians used data for pyramid construction planning
Romans conducted systematic population censuses
India: Arthashastra of Kautilya mentions data collection

📜

Modern Development

17th–20th Century

Graunt (1662): Bills of Mortality — first statistical study of births and deaths
Gauss & Laplace: Normal distribution, method of least squares
Pearson: Correlation coefficient r, chi-square test
Fisher: ANOVA, experimental design, p-values, maximum likelihood

EtymologyFrom the Latin statisticum collegium ("council of state") and Italian statista ("statesman") — originally about data useful to the state. The word entered English statistics in the 18th century.

· · ·

Core Concepts

Definitions & Classifications

PopulationSampleParameterStatisticVariableAttributeQuantitativeQualitativeDiscreteContinuousNominalOrdinalIntervalRatio

🔍

Key Definitions

Must-Know Terms

Population (N): Complete set of all items of interest
Sample (n): Subset of population actually studied
Parameter: Numerical measure of a population (μ, σ, π)
Statistic: Numerical measure of a sample (x̄, s, p̂)
Variable: A measurable characteristic that varies

🗂️

Levels of Measurement

Classification of Data

Nominal: Labels/categories only — gender, blood type, colour. No ordering.
Ordinal: Ordered categories, unequal intervals — ranking, education level
Interval: Equal intervals, no true zero — temperature (°C, °F), IQ
Ratio: True zero exists — weight, height, income, time

Levels of Measurement — Hierarchy

· · ·

Practical View

Uses, Importance & Limitations

✅

Major Uses

Why We Use Statistics

Simplifying complex masses of data into meaningful summaries
Comparing groups, phenomena, and time periods
Establishing relationships between variables
Forecasting future trends based on past data
Testing hypotheses scientifically with rigour

⭐

Importance

Why It Matters

Basis for evidence-based policy and decision-making
Essential in every science, social study, and industry
Enables uncertainty quantification and risk assessment
Guides business, economic, and medical decisions

⚠️

Limitations

What Statistics Cannot Do

Deals only with quantifiable, aggregated facts
Results can be misused or deliberately manipulated
Statistical laws apply to groups, not individuals
Requires homogeneous, high-quality data
Cannot prove causation on its own

· · ·

Data Collection

Sources of Statistical Data

🔵

Primary Sources

Original Data (First-hand)

Direct personal observation
Questionnaires & structured surveys
Interviews (direct/indirect methods)
Experimental data from controlled studies
Registration systems (births, deaths, marriages)

📂

Secondary Sources

Existing/Published Data

Government publications & national census
Research journals, reports & theses
International agencies (UN, WHO, World Bank, IMF)
Newspapers, almanacs, online databases

💡

Which to Choose?

Primary vs Secondary

Use primary when precision & specificity are critical and budget allows. Use secondary when time/cost are constraints. Always check secondary data for reliability, suitability, and adequacy before use.

· · ·

Data Pipeline

Processing & Preprocessing

⚙️

Steps in the Process

Data Processing Pipeline

Editing: Check for errors, omissions, inconsistencies
Coding: Assign numerical values to categorical responses
Classification: Group data into meaningful classes
Tabulation: Arrange data in tables (frequency distributions)
Presentation: Charts, graphs, diagrams for communication

📊

Frequency Distributions

Organising Raw Data

Class interval, class limits, class mark (midpoint)
Class frequency & relative frequency (proportion)
Cumulative frequency (less than / greater than)
Histogram, Frequency Polygon, Ogive (cumulative curve)

Golden Rule of Preprocessing"Garbage in, garbage out." Clean, complete data is the most critical step. Missing values, outliers, and coding errors must be detected and handled before any statistical analysis.

Histogram — Frequency Distribution Concept

· · ·

Descriptive Statistics

Measures of Central Tendency

📖

What it is

The "Centre" of Data

A single value representing the typical or central value in a dataset. The three primary measures are Mean, Median, and Mode, each optimal under different data conditions.

🔢

Key Formulas

The Big Five

AM: Σx / n — arithmetic average
Median: Middle value in sorted data
Mode: Most frequently occurring value
GM: (x₁·x₂·…·xₙ)^(1/n) — for ratios, growth
HM: n / Σ(1/xᵢ) — for rates & speeds

✅

When to Use Each

Right Tool, Right Job

Mean: Symmetric data, no extreme outliers, interval/ratio scale
Median: Skewed distributions, income, housing prices, ordinal data
Mode: Categorical data, most popular item, bimodal distributions
GM: Ratios, growth rates, compound interest, index numbers
HM: Averaging rates, speeds, prices per unit

⚠️

Cautions

Common Mistakes

Mean is highly sensitive to outliers — check for skewness first
Mode may not exist or may not be unique (bimodal)
Never compute the mean for nominal or ordinal data
AM ≥ GM ≥ HM always (equality only when all values equal)

Arithmetic Meanx̄ = (1/n) · Σᵢ xᵢ

Geometric MeanGM = (x₁ · x₂ · … · xₙ)^(1/n) = exp[(1/n)Σ ln xᵢ]

Harmonic MeanHM = n / Σᵢ(1/xᵢ)

InequalityHM ≤ GM ≤ AM (always; equality iff all xᵢ equal)

Median (odd n)M = x₍(n+1)/2₎ after sorting

Median (even n)M = [x₍n/2₎ + x₍n/2+1₎] / 2

Central Tendency — Symmetric vs Skewed Distributions

· · ·

Spread of Data

Measures of Dispersion

📏

What it is

Quantifying Variability

Dispersion measures the spread or variability in a dataset. Two datasets can have the same mean but vastly different spreads — dispersion captures this critical difference.

⚙️

All Measures

Absolute & Relative

Range: Max − Min (simplest; very sensitive to outliers)
Quartile Deviation (QD): (Q3−Q1)/2
Mean Deviation (MD): Σ|x−x̄| / n
Variance (σ²): Σ(x−x̄)² / n or s² = Σ(x−x̄)² / (n−1)
Std Deviation (σ): √Variance
Coeff. of Variation (CV): (σ/x̄)×100 — unit-free comparator

💡

Main Idea

Absolute vs Relative

Absolute: Range, SD, Variance — in original units; cannot compare datasets with different units
Relative: CV — unit-free percentage; use to compare variability across different datasets

Population Varianceσ² = (1/N) · Σᵢ(xᵢ − μ)²

Sample Variances² = (1/(n−1)) · Σᵢ(xᵢ − x̄)²

Std Deviationσ = √[ Σᵢ(xᵢ − x̄)² / N ]

Computing formulaσ² = (1/n)Σxᵢ² − x̄²

Coeff. of VariationCV = (σ / x̄) × 100%

Quartile DeviationQD = (Q3 − Q1) / 2

· · ·

Economic Measurement

Index Numbers

📈

What it is

Relative Change Measure

An index number measures the relative change in a variable (or group) compared to a base period. Expressed as a percentage relative to the base (base period = 100). Used to track changes over time.

⚙️

Types

Key Methods

Laspeyres Index: Uses base-period quantities as weights
Paasche Index: Uses current-period quantities as weights
Fisher's Ideal Index: Geometric mean of Laspeyres & Paasche — satisfies time reversal & factor reversal tests
Value index: Ratio of current to base-period value

✅

Real-World Use

Applications

Consumer Price Index (CPI) — measuring inflation
Stock market indices (S&P 500, BSE Sensex)
Human Development Index (HDI)
Adjusting wages for purchasing power

Laspeyres P-IndexL = (Σ p₁q₀) / (Σ p₀q₀) × 100

Paasche P-IndexP = (Σ p₁q₁) / (Σ p₀q₁) × 100

Fisher Ideal IndexF = √(L × P)

Simple Price Rel.P₀₁ = (p₁ / p₀) × 100

· · ·

Temporal Data

Time Series Basics

🕐

What it is

Data Over Time

A time series is a sequence of data points collected at successive, equally-spaced time intervals. Goal: identify patterns, decompose components, and forecast future values.

💡

4 Components

Decomposition (TSCI)

Trend (T): Long-term direction (upward/downward/stationary)
Seasonal (S): Regular periodic fluctuations within a year
Cyclical (C): Long-run waves lasting 2–10 years (business cycles)
Irregular (I): Random, unpredictable residual variation

⚙️

Methods

Trend Estimation

Moving averages: Simple smoothing of irregular fluctuations
Least squares: Fit linear/polynomial trend equation
Exponential smoothing: Weighted past observations

Additive ModelY = T + S + C + I

Multiplicative ModelY = T × S × C × I

Trend Line (OLS)Ŷ = a + bt (t = coded time)

3-Period Moving AvgMA₃ = (Yₜ₋₁ + Yₜ + Yₜ₊₁) / 3

· · ·

Bivariate Analysis

Correlation

🔗

What it is

Measuring Association

Correlation measures the strength and direction of the linear relationship between two variables. The Pearson correlation coefficient r ranges from −1 to +1.

💡

Types

Types of Correlation

Positive (r > 0): Both variables increase together
Negative (r < 0): One increases, other decreases
Zero (r = 0): No linear relationship
Perfect (r = ±1): All points on a straight line

⚙️

Methods

How to Compute

Pearson's r: For interval/ratio data with linear relation
Spearman's ρ: For ordinal/ranked data or non-linear relations
Scatter diagram: Always plot first — visualise the relationship

⚠️

Critical Warning

Correlation ≠ Causation

High correlation does not prove one variable causes the other. A lurking (confounding) variable may drive both. Always investigate mechanism and theory before claiming causation.

Pearson's rr = Σ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)²]

r (computing form)r = [nΣxy − (Σx)(Σy)] / √{[nΣx²−(Σx)²][nΣy²−(Σy)²]}

Spearman's ρρ = 1 − 6Σdᵢ² / [n(n²−1)] (dᵢ = rank difference)

r² (Coeff. of Det.)r² = Explained variation / Total variation

Correlation Strength — Scatter Plot Patterns

· · ·

Prediction

Regression Analysis

📉

What it is

Line of Best Fit

Regression establishes a mathematical relationship to predict the value of a dependent variable (Y) from an independent variable (X) using the principle of Ordinary Least Squares (OLS).

⚙️

OLS Principle

Minimising Residuals

OLS minimises the sum of squared residuals (SSE) — the vertical distances between observed Y and predicted Ŷ. This gives the unique best-fit line through the data. Two regression lines exist: Y on X, and X on Y; they intersect at (x̄, ȳ).

💡

Link with Correlation

Regression Coefficients & r

b_yx × b_xy = r² (always)
r = √(b_yx × b_xy) when both same sign
Sign of b always equals sign of r
r² = proportion of variance explained

Regression Line Y on XŶ = ȳ + b_yx(x − x̄)

Slope b_yxb_yx = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = r · (σ_y / σ_x)

Intercept aa = ȳ − b_yx · x̄

Regression Line X on YX̂ = x̄ + b_xy(y − ȳ)

b_xyb_xy = r · (σ_x / σ_y)

r² (Coeff. of Det.)r² = b_yx × b_xy = SSR / SST

· · ·

Qualitative Data

Analysis of Attributes

🔤

What it is

Non-numerical Characteristics

Attributes are qualitative characteristics (literacy, colour, gender, disease) that are categorised rather than measured. Analysis counts classes and tests association between categories.

⚙️

Methods

Statistical Tools

Contingency tables: Cross-tabulation of two attributes
χ² test: Tests independence between attributes
Yule's Q: Coefficient of association (−1 to +1)
Consistency check: Ensure all class frequencies ≥ 0

💡

Association

When Are Attributes Related?

Two attributes are associated if their joint frequency differs from expectation under independence. Positive association: both present together more than chance. Negative: inversely linked.

Chi-square Testχ² = Σ (O − E)² / E (O=observed, E=expected)

Expected FrequencyE = (Row total × Column total) / Grand total

Yule's QQ = (AD − BC) / (AD + BC)

· · ·

Distribution Shape

Shape Characteristics — Skewness & Kurtosis

〰️

Skewness

Asymmetry of Distribution

Symmetric (Sk=0): Mean = Median = Mode
Positive skew (+): Mean > Median > Mode — right tail longer
Negative skew (−): Mean < Median < Mode — left tail longer

📐

Kurtosis

Peakedness (Tailedness)

Mesokurtic (β₂=3): Normal distribution — standard shape
Leptokurtic (β₂>3): More peaked, heavier tails than normal
Platykurtic (β₂<3): Flatter peak, lighter tails than normal
Excess kurtosis = β₂ − 3

Pearson's SkewnessSk₁ = (Mean − Mode) / σ

Pearson's 2ndSk₂ = 3(Mean − Median) / σ

Bowley's SkewnessSk_B = (Q3 + Q1 − 2·Median) / (Q3 − Q1)

Kurtosis β₂β₂ = μ₄ / σ⁴ (4th central moment / (σ²)²)

Excess Kurtosisγ₂ = β₂ − 3 (= 0 for normal)

· · ·

Two-Variable Analysis

Bivariate Distribution

📊

What it is

Joint Distribution of (X, Y)

A bivariate distribution shows the joint frequency distribution of two variables simultaneously — revealing their individual behaviour AND their joint patterns and dependence structure.

⚙️

Key Concepts

Components

Marginal distributions: Distribution of each variable alone (sum over other)
Conditional distributions: One variable given the other fixed
Bivariate normal: 2D bell curve — described by μₓ, μᵧ, σₓ, σᵧ, and ρ

💡

Why It Matters

Bridge to Multivariate

Bivariate analysis is the essential bridge between single-variable and multivariate statistics. Correlation and regression both rest on understanding the bivariate joint distribution of (X, Y).

Bivariate Normalf(x,y) parameterised by: μₓ, μᵧ, σₓ, σᵧ, ρ

Conditional MeanE(Y|X=x) = μᵧ + ρ·(σᵧ/σₓ)·(x − μₓ)

Conditional VarianceVar(Y|X=x) = σᵧ²(1 − ρ²)

Independenceρ = 0 → X and Y uncorrelated (independent in bivariate normal)

STAT1102 · Probability Theory

Topics — STAT1102 Probability Theory

Set Theory & Algebra of Sets Probability Fundamentals & Axioms Permutations, Combinations & Counting Conditional Probability & Bayes' Theorem Random Variables & Mathematical Expectation

Foundations

Set Theory & Algebra of Sets

📖

What it is

Sets & Notation

Set: A well-defined collection of distinct objects
Roster: A = {1, 2, 3}; Set-builder: {x : x < 4, x ∈ ℕ}
Universal set (Ω): Contains all elements under study
Empty set (∅): No elements; ∅ ⊆ every set

⚙️

Set Operations

Union, Intersection, Complement

Union (A ∪ B): Elements in A or B or both
Intersection (A ∩ B): Elements in both A and B
Complement (Aᶜ): In Ω but not in A
Difference (A − B): In A but not in B = A ∩ Bᶜ
Sym. difference (A △ B): (A−B) ∪ (B−A)

💡

Key Laws

Algebra Laws

Commutative: A∪B = B∪A; A∩B = B∩A
Associative: (A∪B)∪C = A∪(B∪C)
Distributive: A∩(B∪C) = (A∩B)∪(A∩C)
Idempotent: A∪A = A; A∩A = A

🏛️

De Morgan's Laws

Complement of Unions

(A ∪ B)ᶜ = Aᶜ ∩ Bᶜ — complement of union = intersection of complements
(A ∩ B)ᶜ = Aᶜ ∪ Bᶜ — complement of intersection = union of complements
Used constantly to find complements of complex events

De Morgan I(A ∪ B)ᶜ = Aᶜ ∩ Bᶜ

De Morgan II(A ∩ B)ᶜ = Aᶜ ∪ Bᶜ

Inclusion-Exclusion|A ∪ B| = |A| + |B| − |A ∩ B|

Power Set Size|P(A)| = 2^|A|

· · ·

Core Theory

Probability Fundamentals & Axioms

🎲

Building Blocks

Experiment, Outcomes, Events

Random experiment: Outcome not predictable with certainty
Sample space (S): Set of ALL possible outcomes
Event (A): A subset of the sample space
Mutually exclusive: A ∩ B = ∅
Exhaustive: A ∪ B = S

📐

Kolmogorov's 3 Axioms

Foundations of Probability

Axiom 1: P(A) ≥ 0 for every event A
Axiom 2: P(S) = 1 (certain event)
Axiom 3: Mutually exclusive events: P(A∪B) = P(A)+P(B)

⚙️

Methods of Assignment

4 Approaches

Classical: P(A) = m/n (equally likely outcomes)
Relative frequency: P(A) = lim f/n as n→∞
Subjective: Expert judgment & belief
Axiomatic: Kolmogorov's general framework

Classical ProbabilityP(A) = (No. of favourable outcomes) / (Total equally likely outcomes)

Complement RuleP(Aᶜ) = 1 − P(A)

Bounds0 ≤ P(A) ≤ 1 always

Addition (general)P(A ∪ B) = P(A) + P(B) − P(A ∩ B)

Mutually exclusiveP(A ∪ B) = P(A) + P(B)

· · ·

Counting

Permutations, Combinations & Counting Rules

🔢

Fundamental Principle

Multiplication Rule

If task 1 can be done in m ways and task 2 in n ways, then both can be done in m × n ways. Extended to k tasks: m₁ × m₂ × … × mₖ.

🔀

Permutations

Ordered Arrangements

nPr = n! / (n−r)! arrangements of r from n
All n items: n! arrangements
With repetition: nʳ arrangements
Circular: (n−1)! arrangements

🎯

Combinations

Unordered Selections

nCr = n! / [r!(n−r)!] — select r from n, order irrelevant
Also written C(n,r) or ⁿCᵣ or (n choose r)
nC0 = nCn = 1; nC1 = n
nCr = nC(n−r) (symmetry)

Permutation nPrn! / (n−r)!

Combination nCrn! / [r! · (n−r)!]

Binomial Theorem(a+b)ⁿ = Σₖ nCk · aⁿ⁻ᵏ · bᵏ

3-Event AdditionP(A∪B∪C) = P(A)+P(B)+P(C)−P(A∩B)−P(B∩C)−P(A∩C)+P(A∩B∩C)

· · ·

Updated Belief

Conditional Probability, Independence & Bayes' Theorem

🔍

Conditional Probability

Probability Given Information

P(A|B) = the probability of A given that B has already occurred. We restrict the sample space to B and measure A within it. This is the "updated" probability with new information.

💡

Independence

When Knowledge Changes Nothing

A and B independent iff P(A|B) = P(A)
Equivalently: P(A ∩ B) = P(A) · P(B)
Independence ≠ mutual exclusivity
Mutually exclusive events with P>0 are never independent

🔄

Bayes' Theorem

Reversing Conditional Probability

Given P(E|H) we find P(H|E). We update prior belief P(H) with evidence E to get posterior P(H|E). Used in: medical diagnosis, spam filtering, ML classifiers.

✅

Law of Total Probability

Averaging Over Causes

If {H₁,…,Hₙ} is a partition of S, then: P(E) = Σᵢ P(E|Hᵢ)·P(Hᵢ). The denominator of Bayes' theorem — the total probability of the evidence.

Conditional Prob.P(A|B) = P(A ∩ B) / P(B)

Multiplication RuleP(A ∩ B) = P(A) · P(B|A) = P(B) · P(A|B)

Independence TestP(A ∩ B) = P(A) · P(B) iff independent

Total ProbabilityP(E) = Σᵢ P(E|Hᵢ) · P(Hᵢ)

Bayes' TheoremP(Hᵢ|E) = P(E|Hᵢ)·P(Hᵢ) / Σⱼ P(E|Hⱼ)·P(Hⱼ)

Bayes IntuitionA medical test is positive. Bayes tells you the true probability of actually having the disease, accounting for the test's false positive rate AND the disease prevalence (prior). Without Bayes, most people vastly overestimate their risk.

· · ·

Core Theory

Random Variables & Mathematical Expectation

🎯

Random Variable

Mapping Outcomes to Numbers

X: S → ℝ assigns a real number to each sample point. Capital X = the RV (function); lowercase x = the value it takes. Converts non-numeric experiments into numbers for analysis.

💡

Discrete vs Continuous

Two Types of RVs

Discrete: Countable values {0,1,2,…} — described by PMF p(x)
Continuous: Any value in an interval — described by PDF f(x)
CDF F(x) = P(X ≤ x) exists for both types

⚙️

Expectation & Moments

Summary Measures

E(X): Probability-weighted average — the "centre of gravity"
Var(X) = E(X²) − [E(X)]²
rth raw moment: μ'ᵣ = E(Xʳ)
rth central moment: μᵣ = E[(X−μ)ʳ]
Linearity: E(aX+b) = aE(X)+b

🔢

Covariance & Correlation

Between Two RVs

Cov(X,Y) = E(XY) − E(X)·E(Y)
ρ(X,Y) = Cov(X,Y) / (σ_X·σ_Y)
Independent → Cov = 0 (not always vice versa)

E(X) — discreteΣₓ x · p(x) where Σ p(x)=1

E(X) — continuous∫₋∞^∞ x · f(x) dx where ∫f(x)dx=1

VarianceVar(X) = E(X²) − [E(X)]²

CDFF(x) = P(X ≤ x)

CovarianceCov(X,Y) = E[(X−μₓ)(Y−μᵧ)]

Var(aX+bY)a²Var(X) + b²Var(Y) + 2ab·Cov(X,Y)

SYAT2102 · Probability Distributions

BernoulliBinomialPoissonGeometricNeg. BinomialHypergeometricUniformNormalExponentialGammaBeta

Discrete Distributions

Bernoulli, Binomial & Poisson

🪙

Bernoulli(p)

Single Trial — Success/Failure

One trial, two outcomes: 1 (success) with prob p, 0 with prob (1−p)
E(X) = p; Var(X) = p(1−p)
Building block for Binomial

🎰

Binomial(n, p)

n Independent Bernoulli Trials

Counts number of successes in n independent trials
P(X=k) = C(n,k)·pᵏ·(1−p)ⁿ⁻ᵏ
E(X) = np; Var(X) = np(1−p)
Use when: fixed n, each trial independent, constant p

☎️

Poisson(λ)

Rare Events in Time/Space

Counts events in a fixed interval (time, area, volume)
P(X=k) = e⁻λ·λᵏ / k!
E(X) = Var(X) = λ — unique equal mean & variance!
Use for: calls/hour, defects/unit, accidents/year

🔢

Geometric(p)

Waiting for First Success

P(X=k) = (1−p)^(k−1)·p where k=1,2,3,…
E(X) = 1/p; Var(X) = (1−p)/p²
Memoryless: P(X>s+t|X>s) = P(X>t)
Use for: number of trials to first success

Binomial PMFP(X=k) = C(n,k) · pᵏ · (1−p)ⁿ⁻ᵏ

Binomial Mean/VarE(X) = np ; Var(X) = np(1−p)

Poisson PMFP(X=k) = e⁻λ · λᵏ / k! (k=0,1,2,…)

Poisson Mean/VarE(X) = Var(X) = λ

Geometric PMFP(X=k) = (1−p)^(k−1) · p

HypergeometricP(X=k) = C(K,k)·C(N−K,n−k) / C(N,n)

Binomial(10, 0.3) vs Poisson(3) — PMF Comparison

· · ·

Continuous Distributions

Normal, Exponential, Uniform, Gamma & Beta

🔔

Normal N(μ, σ²)

The Bell Curve — Most Important

Symmetric about mean μ; inflection points at μ±σ
68-95-99.7 rule for 1σ, 2σ, 3σ from mean
Standard Normal Z ~ N(0,1): Z = (X−μ)/σ
Central Limit Theorem: sample means → Normal

⏱️

Exponential(λ)

Time Until First Event

f(x) = λe⁻λˣ for x ≥ 0
E(X) = 1/λ; Var(X) = 1/λ²
Memoryless: P(X>s+t|X>s) = P(X>t)
Continuous analog of geometric distribution

📐

Uniform U(a, b)

Equal Probability Everywhere

f(x) = 1/(b−a) for a ≤ x ≤ b
E(X) = (a+b)/2; Var(X) = (b−a)²/12
All values equally likely in [a, b]

🌀

Gamma & Beta

Flexible Family Distributions

Gamma(α,β): Generalises exponential; waiting time for αth event. E(X)=αβ
Beta(α,β): Defined on [0,1]; used for proportions, probabilities. Very flexible shape.

Normal PDFf(x) = (1/σ√2π) · exp[−(x−μ)²/(2σ²)]

Standard Normal ZZ = (X − μ) / σ ~ N(0,1)

Exponential PDFf(x) = λ·e⁻λˣ , x ≥ 0 ; E(X)=1/λ

Uniform PDFf(x) = 1/(b−a) for x ∈ [a,b]

Gamma PDFf(x) = xᵅ⁻¹·e⁻ˣ/ᵝ / [βᵅ·Γ(α)] , x>0

Beta PDFf(x) = xᵅ⁻¹(1−x)ᵝ⁻¹/B(α,β) , x∈[0,1]

Normal Distribution — The 68-95-99.7 Empirical Rule

Which Distribution to Use?Binary single trial → Bernoulli. Counting successes in n fixed trials (replacement) → Binomial. Rare events in time/space → Poisson. Waiting for first event → Geometric/Exponential. Without replacement → Hypergeometric. Heights, errors, averages → Normal. Waiting for αth event → Gamma. Proportions → Beta.

STAT2101 · Regression Analysis & Diagnostics

Topics Covered

Simple Linear Regression Model OLS Estimation & Properties Hypothesis Tests & Confidence Intervals ANOVA for Regression Residual Analysis & Diagnostics Violations & Remedies Influential Points & Leverage Multiple Regression

Foundation

Simple Linear Regression Model

📉

What it is

The Population Model

Y = β₀ + β₁X + ε. We model the linear relationship between a response Y (dependent) and a predictor X (independent), where ε is random error. We estimate β₀ & β₁ from sample data.

⚙️

Model Assumptions

LINE Assumptions

L — Linearity: True relationship is linear in X
I — Independence: Errors εᵢ are independent
N — Normality: Errors ~ N(0, σ²)
E — Equal variance: Var(εᵢ) = σ² (homoscedasticity)

💡

Interpretation

Meaning of Coefficients

β₀ (intercept): Expected value of Y when X = 0
β₁ (slope): Change in E(Y) for each 1-unit increase in X
Sign of β₁ tells direction; magnitude tells strength

✅

Where to Use

Regression Applications

Predicting outcomes (sales, yield, price) from predictors
Quantifying effect size of a predictor on outcome
Controlling for confounders in observational studies
Building clinical prediction models

Population ModelYᵢ = β₀ + β₁Xᵢ + εᵢ , εᵢ ~ N(0, σ²)

Fitted ModelŶᵢ = b₀ + b₁Xᵢ

Residualseᵢ = Yᵢ − Ŷᵢ (observed minus predicted)

Simple Linear Regression — Fitted Line & Residuals

· · ·

Estimation

OLS Estimation & BLUE Properties

⚙️

OLS Principle

Minimise Sum of Squared Errors

We choose b₀ and b₁ to minimise SSE = Σ(Yᵢ − b₀ − b₁Xᵢ)². Taking partial derivatives and setting to zero gives the normal equations, leading to closed-form solutions.

💡

Gauss-Markov Theorem

BLUE Estimators

Under the LINE assumptions, OLS estimators are Best Linear Unbiased Estimators (BLUE). They have the smallest variance among all linear unbiased estimators. This is the most important theorem in regression.

📊

Variance Decomposition

SST = SSR + SSE

SST: Total sum of squares = Σ(Yᵢ−ȳ)²
SSR: Regression SS = Σ(Ŷᵢ−ȳ)² (explained by model)
SSE: Error SS = Σ(Yᵢ−Ŷᵢ)² (unexplained/residual)
R² = SSR/SST — proportion of variance explained

OLS Slopeb₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = Sxy / Sxx

OLS Interceptb₀ = ȳ − b₁x̄

Error Variance σ²s² = MSE = SSE / (n−2) (unbiased estimator)

SSTΣ(yᵢ−ȳ)² = SSR + SSE

R² (Coeff. of Det.)R² = SSR/SST = 1 − SSE/SST ∈ [0,1]

Var(b₁)σ²/Sxx = σ² / Σ(xᵢ−x̄)²

· · ·

Inference

Hypothesis Tests & Confidence Intervals

🔬

t-Test for Slope

Is X a Significant Predictor?

H₀: β₁ = 0 (X has no linear effect on Y)
H₁: β₁ ≠ 0 (X is a significant predictor)
t = b₁ / SE(b₁) ~ t(n−2) under H₀
Reject H₀ if |t| > t_(α/2, n−2)

📏

Confidence Intervals

For β₁ and Mean Response

CI for β₁: b₁ ± t_(α/2, n−2) · SE(b₁)
CI for E(Y|x*): Ŷ ± t · s·√[1/n + (x*−x̄)²/Sxx]
PI for new Y: Ŷ ± t · s·√[1 + 1/n + (x*−x̄)²/Sxx] — wider!

💡

CI vs Prediction Interval

Key Distinction

CI for mean E(Y|x*) is narrower — for the average at x*. Prediction interval (PI) is wider — for an individual future observation. PI includes extra uncertainty from ε. Both narrow near x̄, widen as x* moves away.

🔢

F-Test

Overall Model Significance

H₀: All β₁ = … = βₖ = 0 (no predictors help)
F = MSR / MSE ~ F(k, n−k−1) under H₀
Equivalent to t-test in simple regression (F = t²)

t-statistic for β₁t = b₁ / [s / √Sxx] ~ t(n−2)

SE(b₁)SE(b₁) = s / √Sxx where s = √MSE

CI for β₁b₁ ± t_(α/2, n−2) · SE(b₁)

F for overall modelF = MSR / MSE = (SSR/k) / (SSE/(n−k−1))

· · ·

Variance Partitioning

ANOVA Table for Regression

ANOVA Table Structure — Simple Linear Regression

· · ·

Model Checking

Residual Analysis & Diagnostics

🔬

Why Diagnostics?

Checking Model Assumptions

Residuals eᵢ = Yᵢ − Ŷᵢ carry information about assumption violations. Always plot residuals before trusting inference. A good model has residuals that look like random noise.

📊

Key Diagnostic Plots

4 Essential Plots

Residuals vs Fitted (Ŷᵢ): Check linearity & homoscedasticity. Should be random scatter around zero.
Normal Q-Q plot: Check normality of residuals. Points should lie on a straight diagonal line.
Scale-Location plot: √|eᵢ| vs Ŷᵢ — check homoscedasticity.
Residuals vs Leverage: Identify influential points & Cook's D.

💡

Standardised Residuals

Types of Residuals

Ordinary: eᵢ = Yᵢ − Ŷᵢ (raw residuals)
Standardised: rᵢ = eᵢ / (s√(1−hᵢᵢ)) — scale-free; should be within ±2
Studentised deleted: rᵢ* — uses s₍ᵢ₎ without point i — best for outlier detection

Residual Patterns — Diagnosing Assumption Violations

· · ·

Diagnostics

Assumption Violations & Remedies

📈

Non-linearity

Pattern in Residuals

Detect: Curved pattern in residuals vs Ŷ plot
Remedy: Add polynomial term (X²), log-transform X or Y, use non-parametric regression
Test: Ramsey RESET test

📡

Heteroscedasticity

Non-constant Variance

Detect: Fan shape in residuals; Breusch-Pagan or White test
Consequence: OLS still unbiased but NOT BLUE; SEs are wrong
Remedy: WLS (weighted least squares), log(Y), robust SEs (HC errors)

🔗

Autocorrelation

Non-independent Errors

Detect: Durbin-Watson test (d ≈ 2 is good; d < 1.5 or d > 2.5 signals problem)
Common in: Time series data
Remedy: Include lagged variables, GLS, Cochrane-Orcutt

🧩

Multicollinearity

Correlated Predictors

Detect: VIF > 10 signals serious problem; VIF > 5 is concerning
Consequence: Large SEs, unstable coefficient estimates, wrong signs
Remedy: Drop one correlated variable, ridge regression, PCA

Durbin-Watson dd = Σᵢ(eᵢ − eᵢ₋₁)² / Σᵢeᵢ² ∈ [0,4]; d≈2 → no autocorrelation

VIF (Var. Inflation)VIF_j = 1/(1 − Rj²) where Rj² = R² of Xj on all other predictors

Breusch-Pagan TestRegress eᵢ² on Xᵢ; test F or nR² ~ χ²(k)

· · ·

Outlier Detection

Influential Points, Outliers & Leverage

🎯

Outliers in Y

Large Residuals

A point with a large studentised residual |rᵢ| > 2 or 3. Outliers in Y can inflate MSE and distort regression estimates. Check if real or data error.

🔭

High Leverage Points

Outliers in X Space

Leverage hᵢᵢ (hat matrix diagonal) measures how far Xᵢ is from x̄. Rule of thumb: hᵢᵢ > 2(k+1)/n signals high leverage. High leverage = potential for high influence.

💡

Cook's Distance D

Overall Influence

Cook's D measures the effect of deleting point i on ALL fitted values. D > 1 (or D > 4/n) suggests the point is influential. Combines residual size and leverage: a high-leverage point with large residual is most influential.

🔢

DFFITS & DFBETAS

Change-in-Fit Statistics

DFFITS: Change in Ŷᵢ when point i is deleted (standardised)
DFBETAS_j: Change in b_j when point i is deleted
Flag if |DFFITS| > 2√(k/n)

Leverage hᵢᵢhᵢᵢ = (Hat matrix)ᵢᵢ = 1/n + (xᵢ−x̄)²/Sxx ∈ [1/n, 1]

Standardised Resid.rᵢ = eᵢ / (s√(1−hᵢᵢ))

Cook's DistanceDᵢ = eᵢ² · hᵢᵢ / [p · MSE · (1−hᵢᵢ)²] (p = k+1 parameters)

· · ·

Extension

Multiple Linear Regression

🔢

The Model

k Predictors

Y = β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ + ε. Each βⱼ is the partial effect of Xⱼ on Y, holding all other predictors constant. Estimated by matrix algebra: b = (X'X)⁻¹X'Y.

💡

Adjusted R²

Penalised Fit Measure

R² always increases when adding predictors (even irrelevant ones). Adjusted R² penalises for the number of predictors — use this to compare models with different numbers of predictors.

⚙️

Model Selection

Choosing Predictors

Forward selection: Add predictors one at a time
Backward elimination: Remove least significant predictors
Stepwise: Combine both directions
AIC/BIC: Information criteria — lower is better
Cross-validation: Out-of-sample prediction error

🎯

Logistic Regression

Binary Response Variable

When Y ∈ {0,1}, linear regression is inappropriate. Use logistic regression: log[p/(1−p)] = β₀ + β₁X₁ + …. Coefficients interpreted as log-odds; exp(βⱼ) = odds ratio. Estimated by MLE, not OLS.

Matrix FormY = Xβ + ε ; b = (X'X)⁻¹X'Y

Adjusted R²R²_adj = 1 − [(n−1)/(n−k−1)] · (1 − R²)

AICAIC = n·ln(SSE/n) + 2k

Logit Modelln[p/(1−p)] = β₀ + Σ βⱼXⱼ ; p = P(Y=1|X)

Odds RatioOR_j = exp(βⱼ) — effect of 1-unit increase in Xⱼ on odds of Y=1

OLS vs LogisticUse OLS regression when Y is continuous (approximately). Use logistic regression when Y is binary (0/1). Never fit a linear regression to a binary outcome — it can predict probabilities outside [0,1] and violates the normality/homoscedasticity assumptions.

STAT3203 · Econometrics

🎓 What is Econometrics? Econometrics is what happens when statistics and economics go on a date and have a baby called "regression." It asks: "Yes, we think X causes Y in theory — but how strong is that relationship in actual data, and can we prove it?" As Gujarati puts it: "Econometrics is the art and science of using statistical methods to test economic theories and forecast economic phenomena." The joke among economists: "Economists use models to explain what has already happened, and models to predict the future — and the same model is usually wrong in both cases." 😄

Topics — STAT3203 Econometrics

What is Econometrics Classical Linear Regression Model (CLRM) OLS Estimation & Gauss-Markov Multicollinearity Heteroscedasticity Autocorrelation Specification Errors Dummy Variables Simultaneous Equation Models Time Series Econometrics

Introduction

What is Econometrics?

📖

What it is

Definition

Econometrics = Economics + Metrics. It applies statistical and mathematical methods to quantify economic relationships, test economic theories, and forecast future economic activity. Gujarati defines it as the "quantitative analysis of actual economic phenomena."

⚙️

The Three Steps

Econometric Methodology

1. Economic model: Theory says Y depends on X₁, X₂,… (e.g., consumption depends on income)
2. Econometric model: Add error term — Y = f(X₁,X₂) + ε
3. Estimate & test: Use data to estimate parameters and test hypotheses

💡

Real World Example

Keynesian Consumption Function

Theory: Consumption increases with income.
Econometric model: C = β₀ + β₁Y + ε
β₁ = Marginal Propensity to Consume (MPC) — how much of each extra taka is consumed. We estimate this from real survey data!

✅

Where Used

Applications

Estimating wage-education returns (does education pay?)
Measuring price elasticity of demand
Evaluating effect of minimum wage on employment
Forecasting GDP, inflation, exchange rates
Policy evaluation (did a program reduce poverty?)

😂 Econometrician's Joke"An economist, a physicist, and an econometrician are stranded on an island with canned food. The physicist says 'let's use a rock to open the cans.' The economist says 'assume we have a can opener.' The econometrician says 'let's regress can-opening on island conditions, correct for heteroscedasticity, and check the instrumental variables.'" — Econometrics solves real problems, just very thoroughly! 😄

· · ·

Foundation

Classical Linear Regression Model (CLRM)

🎯 The CLRM is the BackboneEvery econometrics problem starts by asking: "Which CLRM assumption is violated here?" Like a doctor checking vital signs before treating a patient — you must check the assumptions before trusting the results.

📋

The 10 Assumptions

CLRM Assumptions (Gujarati)

A1: Linear in parameters — model is linear in β (not necessarily in X)
A2: Fixed X values — X is non-stochastic (or fixed in repeated sampling)
A3: Zero mean error — E(εᵢ) = 0
A4: Homoscedasticity — Var(εᵢ) = σ² (constant)
A5: No autocorrelation — Cov(εᵢ, εⱼ) = 0, i≠j
A6: X non-stochastic — Cov(εᵢ, Xᵢ) = 0
A7: n > k — more observations than parameters
A8: Variability in X — Var(X) ≠ 0
A9: No perfect multicollinearity — no exact linear relation among Xs
A10: Normality of ε — εᵢ ~ N(0, σ²)

💡

LINE Simplified

Remember: LINE

Linearity — relationship is linear in parameters
Independence — errors are independent of each other
Normality — errors are normally distributed
Equal variance — errors have constant variance (homoscedastic)

😄 Memory tip: "LINE up your assumptions or your results will be crooked!"

⚠️

What Happens When Violated

Consequences Table

A4 violated (hetero): OLS unbiased but inefficient; wrong SEs
A5 violated (autocorr): OLS unbiased but inefficient; wrong SEs
A9 violated (multicoll): OLS unbiased but very large variance; unreliable estimates
Omitted variable: OLS biased AND inconsistent — the worst!

🌍

Real Scenario

Estimating Wage Equation

Model: Wage = β₀ + β₁Education + β₂Experience + ε
Check: Does error have constant variance? (Workers with more education may have more variable wages → heteroscedasticity). Are education & experience correlated? (Older workers often have more experience AND education → multicollinearity). Always diagnose first!

CLRM ModelYᵢ = β₁ + β₂X₂ᵢ + β₃X₃ᵢ + … + βₖXₖᵢ + εᵢ

Error assumptionsE(εᵢ)=0 ; Var(εᵢ)=σ² ; Cov(εᵢ,εⱼ)=0 (i≠j) ; εᵢ~N(0,σ²)

Matrix formY = Xβ + ε ; β̂ = (X'X)⁻¹X'Y (OLS estimator)

· · ·

Estimation

OLS & Gauss-Markov Theorem

🎯

OLS Principle

Minimise Squared Errors

OLS chooses β̂ to minimise SSE = Σeᵢ² = Σ(Yᵢ − Ŷᵢ)². The "squaring" penalises large errors more — like a strict teacher who really hates big mistakes more than small ones! 😄 The solution is unique and closed-form.

🏆

Gauss-Markov Theorem

BLUE — Why OLS is Best

Under assumptions A1–A9 (without normality), OLS estimators are:
Best — minimum variance
Linear — in Y
Unbiased — E(β̂) = β
Estimators
No other linear unbiased estimator has smaller variance! Think of it as OLS being the "most efficient honest statistician."

⚙️

OLS Properties

Algebraic Properties

Σeᵢ = 0 (residuals sum to zero)
Σeᵢ·Xᵢ = 0 (residuals uncorrelated with X)
Regression line passes through (X̄, Ȳ)
Σeᵢ·Ŷᵢ = 0 (residuals uncorrelated with fitted values)

📐

Goodness of Fit

R² and its Limits

R² ∈ [0,1]; R²=1 perfect fit; R²=0 model explains nothing
Warning: High R² ≠ good model! You can have high R² with spurious regression (two random trends)
Adjusted R²: Penalises for extra predictors — use for model comparison
😄 "A high R² in time series is suspicious, not impressive!"

OLS slope (simple)β̂₂ = Σ(Xᵢ−X̄)(Yᵢ−Ȳ) / Σ(Xᵢ−X̄)² = Cov(X,Y)/Var(X)

OLS interceptβ̂₁ = Ȳ − β̂₂·X̄

UnbiasednessE(β̂) = β (on average, hits the true value)

R²R² = ESS/TSS = 1 − RSS/TSS ; RSS=Σeᵢ², ESS=Σ(Ŷᵢ−Ȳ)², TSS=Σ(Yᵢ−Ȳ)²

Adjusted R²R̄² = 1 − [(n−1)/(n−k)](1−R²)

🌍 Real WorldBangladesh rice yield data: Yield = 1200 + 45·Fertiliser + 30·Rain + ε. R² = 0.82 means 82% of variation in yield is explained by fertiliser and rainfall. β̂₂=45 means: holding rain constant, each kg of fertiliser per acre increases yield by 45 kg. This directly guides agricultural policy!

· · ·

Problem 1

Multicollinearity — The Identity Crisis

😂 The Multicollinearity Joke"Multicollinearity is like trying to tell apart identical twins by asking their friends — everyone says 'they're basically the same.' Your model literally cannot figure out who is doing what." When X₁ and X₂ are nearly perfectly correlated, the model gets confused about whose "fault" it is when Y changes.

🔍

What it is

Correlated Predictors

Multicollinearity occurs when two or more predictor variables are highly correlated with each other. Perfect multicollinearity = exact linear relationship (OLS breaks down entirely). Near-perfect = high but not perfect correlation (OLS works but gives unreliable estimates).

⚙️

Detection

How to Detect

Correlation matrix: |rᵢⱼ| > 0.8 between predictors — warning sign
VIF (Variance Inflation Factor): VIF > 10 = serious; VIF > 5 = concern
Condition number: κ > 30 signals multicollinearity
Rule of thumb sign: High R² but few significant individual t-tests

⚠️

Consequences

What Goes Wrong

OLS still unbiased and BLUE — estimates are correct on average
But standard errors inflate — estimates are imprecise
t-statistics become small → variables appear insignificant even when they matter
Coefficient signs may be wrong or change with small data changes
Confidence intervals become very wide

💡

Remedies

What to Do

Drop a variable — but risk omitted variable bias
Ridge regression — adds a penalty λ to shrink coefficients
Get more data — reduces variance (often the best solution)
Principal Component Analysis (PCA) — use orthogonal components
Combine variables — e.g., use wealth index instead of income + savings

VIF formulaVIF_j = 1/(1 − Rj²) where Rj² = R² from regressing Xⱼ on all other X's

VIF interpretationVIF=1 → no collinearity ; VIF=5 → moderate ; VIF>10 → serious problem

Var(β̂ⱼ) inflatedVar(β̂ⱼ) = σ²/[Sⱼⱼ(1−Rⱼ²)] = (σ²/Sⱼⱼ) · VIF_j

🌍 Bangladesh ExampleRegressing household expenditure on income and wealth. Income and wealth are highly correlated (r=0.92). VIF comes out at 8.2. The model can't tell apart the separate effects of income vs wealth. Solution: use only income, or create a composite "socioeconomic status" score.
😄 Tip: "If two variables always go up together in your data, your model has the same problem as a detective who always finds two suspects at the crime scene at the same time — it cannot tell who did it."

· · ·

Problem 2

Heteroscedasticity — The Unequal Spreader

😄 Analogy"Heteroscedasticity is like a group of students whose test scores vary wildly for rich students (some study hard, some don't) but are very consistent for poor students (all must study). The variance of the 'error' in predicting scores is not equal across income groups." This violates A4!

📡

What it is

Non-constant Error Variance

Heteroscedasticity means Var(εᵢ) = σᵢ² — the variance of the error term is NOT constant across observations. It changes with one or more predictors. Very common in cross-sectional data (individuals, firms, countries with very different sizes).

⚙️

Detection Tests

How to Detect

Visual: Plot residuals vs fitted Ŷ — fan/funnel shape = hetero
Park test: Regress ln(eᵢ²) on ln(Xᵢ)
Glejser test: Regress |eᵢ| on Xᵢ
Breusch-Pagan (BP) test: Lagrange multiplier test — most popular
White test: More general — no specific form assumed

⚠️

Consequences

What Goes Wrong

OLS estimators remain unbiased and consistent
BUT they are no longer BLUE (not minimum variance)
Standard errors are biased → t and F tests unreliable
Confidence intervals too narrow or too wide
Hypothesis tests give wrong conclusions

💡

Remedies

Fixing Heteroscedasticity

WLS (Weighted Least Squares): Weight observations by 1/σᵢ² — best if σᵢ² known
Log transformation: ln(Y) = β₀ + β₁X — often stabilises variance
White's HC standard errors: "Robust" SEs — keeps OLS estimates but corrects SEs
FGLS: Feasible GLS when form is estimated

Heteroscedastic modelYᵢ = β₁ + β₂Xᵢ + εᵢ where Var(εᵢ) = σᵢ² ≠ constant

WLS objectiveMinimise: Σ wᵢeᵢ² where wᵢ = 1/σᵢ² (higher weight = more precise obs.)

Breusch-PaganBP = n·R² ~ χ²(k) from regressing eᵢ²/σ̂² on all X's

White test statn·R² ~ χ²(p) where p = number of regressors in auxiliary regression

🌍 Real ExampleRegressing household food expenditure on income across 1000 Bangladeshi families. Rich families have very variable food spending (some eat lavishly, some save); poor families all spend similarly near subsistence. This creates a fan shape in residuals — classic heteroscedasticity. Remedy: use ln(expenditure) or WLS with weight 1/income².

· · ·

Problem 3

Autocorrelation — The Time Traveller's Problem

😄 The Autocorrelation Joke"Autocorrelation is like a gossip chain. What happened yesterday affects what people say today, which affects tomorrow. Errors in time series data are like rumours — yesterday's error whispers to today's error." When today's residual tells tomorrow's what to be, you have autocorrelation!

🔗

What it is

Correlated Error Terms

Autocorrelation (serial correlation) means Cov(εᵢ, εⱼ) ≠ 0 for i≠j. Violations of assumption A5. Most common in time series data (monthly GDP, daily stock prices, annual inflation). Positive autocorrelation is most common — errors persist in the same direction.

⚙️

Detection

Tests for Autocorrelation

Plot residuals over time: Look for cyclical or trending patterns
Durbin-Watson (DW) test: d ≈ 2 → no autocorrelation; d < 1.5 → positive AC; d > 2.5 → negative AC
Breusch-Godfrey (BG) test: More general — detects higher-order autocorrelation
Run test: Non-parametric test for randomness in residuals

⚠️

Consequences

What Goes Wrong

OLS estimates remain unbiased and consistent
But NOT BLUE — inefficient; larger variances than GLS
s² underestimates σ² → t & F tests give inflated significance
R² is overestimated — model looks better than it is!

💡

Remedies

Fixing Autocorrelation

Generalised Least Squares (GLS): Use the transformed model (most correct)
Cochrane-Orcutt method: Iterative GLS for AR(1) errors
Include lagged Y (Yₜ₋₁): Often removes autocorrelation
Newey-West HAC SEs: Robust SEs that account for autocorrelation
First-differencing: Use ΔY = Yₜ − Yₜ₋₁ as the dependent variable

AR(1) error processεₜ = ρεₜ₋₁ + uₜ where |ρ| < 1 and uₜ ~ WN(0, σ²)

Durbin-Watson dd = Σₜ(eₜ − eₜ₋₁)² / Σeₜ² ≈ 2(1−ρ̂) ; d∈[0,4]

ρ̂ estimatorρ̂ = Σₜ eₜeₜ₋₁ / Σₜ eₜ₋₁²

GLS transformationYₜ − ρYₜ₋₁ = β₁(1−ρ) + β₂(Xₜ − ρXₜ₋₁) + uₜ

🌍 Bangladesh ExampleRegressing annual rice production on fertiliser use and rainfall (1980–2023). The DW statistic = 1.12 signals positive autocorrelation — a good crop year tends to be followed by another good year (farmers reinvest; soil quality persists). Cochrane-Orcutt iteration gives ρ̂ = 0.48, and the corrected model gives more reliable coefficient estimates.

· · ·

Model Misspecification

Specification Errors — Building the Wrong House

🏗️

What it is

Using the Wrong Model

Specification errors arise when the model is incorrectly specified — wrong variables, wrong functional form, or wrong structural assumptions. The most dangerous error in econometrics!

⚠️

Type 1: Omitted Variable

Leaving Out a Key Variable

True model: Y = β₁ + β₂X₂ + β₃X₃ + ε
Estimated model: Y = α₁ + α₂X₂ + u (X₃ omitted)
Result: OLS estimator of β₂ is biased and inconsistent
Bias direction depends on correlation between X₂ and X₃
😄 "Like measuring height but ignoring whether you're on a slope!"

⚠️

Type 2: Irrelevant Variable

Including an Unnecessary Variable

True model: Y = β₁ + β₂X₂ + ε
Estimated model: Y = α₁ + α₂X₂ + α₃X₃ + u (X₃ is irrelevant)
Result: OLS estimators remain unbiased but inefficient (larger variance)
R² increases artificially — use adjusted R² instead!

⚙️

Type 3: Wrong Functional Form

Linear When Non-linear

True: Y = β₁ + β₂X + β₃X² + ε (quadratic)
Fitted: Y = α₁ + α₂X + u (linear)
Residuals will show a curved pattern
RESET test (Ramsey) detects wrong functional form

💡

Detecting Specification Errors

Tests

RESET test: Add Ŷ², Ŷ³ to model; test their joint significance
Davidson-MacKinnon J-test: Test between non-nested models
Residual plots: Patterns indicate misspecification
Theory: Always use economic theory to guide model choice!

Omitted Variable BiasBias(β̂₂) = β₃ · (Cov(X₂,X₃)/Var(X₂)) ≠ 0 if β₃≠0 & X₂,X₃ correlated

RESET testAdd Ŷ², Ŷ³ to regression; F-test on their coefficients. Reject H₀ → misspecification.

🌍 Classic ExampleWage regression omitting "ability." Model: Wage = β₀ + β₁Education + ε. Problem: Ability affects both wages AND education choices. Omitting ability biases β̂₁ upward — we attribute to education some of what is really due to innate ability. This is the classic "ability bias" in returns to education. Solution: use IQ scores, sibling fixed effects, or instrumental variables (Angrist & Krueger's famous quarter-of-birth IV).

· · ·

Qualitative Predictors

Dummy Variables — Turning Categories into Numbers

💡 What is a Dummy Variable?A dummy (indicator) variable takes values 0 or 1 to represent a categorical characteristic. Male = 1, Female = 0. Urban = 1, Rural = 0. It's called "dummy" because it's a stand-in number for something that isn't naturally numeric. 😄 "It's not that the variable is stupid — it's just pretending to be a number!"

🔢

What it is

Binary Indicator Variables

For a qualitative variable with m categories, we include m−1 dummy variables (omit one — the "base" or "reference" category). Including all m dummies causes perfect multicollinearity — the dummy variable trap!

⚙️

Interpretation

Reading Dummy Coefficients

Wage = 5000 + 800·MALE + 200·Education + ε
MALE=1 (male): avg wage = 5000+800+200·Edu = 5800+200·Edu
MALE=0 (female): avg wage = 5000+200·Edu = 5000+200·Edu
So: men earn 800 taka more than women on average, holding education fixed
The dummy coefficient is the shift in intercept for that category

💡

Interaction Dummies

Dummies with Slopes

Wage = β₀ + β₁MALE + β₂Education + β₃(MALE×Education) + ε
β₃ allows the slope of education to differ by gender
Male return to education: β₂ + β₃
Female return to education: β₂
This is the Chow test idea — testing if two groups have different regression relationships

⚠️

Dummy Trap

The Most Common Mistake!

For m categories, ALWAYS include m−1 dummies. If you include all m, the sum of all dummies = 1 (a constant) which creates PERFECT multicollinearity. Example: if you have MALE and FEMALE dummies, they always sum to 1 = the intercept column → perfect collinearity. Drop one! The dropped category is the "reference group."

General formYᵢ = β₀ + β₁Dᵢ + β₂Xᵢ + εᵢ (D=1 for group A, D=0 for group B)

Group A meanE(Yᵢ|Dᵢ=1,Xᵢ) = (β₀+β₁) + β₂Xᵢ (shifted intercept)

Group B meanE(Yᵢ|Dᵢ=0,Xᵢ) = β₀ + β₂Xᵢ (reference group)

Interaction (slope shift)Yᵢ = β₀ + β₁Dᵢ + β₂Xᵢ + β₃(DᵢXᵢ) + εᵢ

Chow Test F-statF = [(SSEᵣ − (SSE₁+SSE₂))/k] / [(SSE₁+SSE₂)/(n₁+n₂−2k)]

🌍 Bangladesh Policy ExampleEvaluating impact of a microfinance program: Treatment = 1 (received loan), Control = 0. Model: Income = β₀ + β₁·Treatment + β₂·Education + β₃·Age + ε. β₁ estimates the Average Treatment Effect (ATE) — did the loan raise income? If β₁ = 2500 (significant), the program raises income by Tk 2500 holding other factors fixed. This is the basis of impact evaluation / program evaluation in development economics!

· · ·

Advanced

Simultaneous Equation Models — Cause and Effect in Both Directions

🔄

What it is

Bidirectional Causality

In many economic situations, variables determine each other simultaneously. Supply & demand: price determines quantity demanded AND quantity supplied determines price. This simultaneity causes OLS to be biased and inconsistent — the "simultaneity bias."

⚙️

Endogenous vs Exogenous

Variable Classification

Endogenous (jointly determined): Price & Quantity in supply-demand system
Exogenous (determined outside): Income, weather, policy variables
Structural form: The economic behavioural equations
Reduced form: Each endogenous variable expressed only in terms of exogenous variables

💡

Identification Problem

Can We Estimate the Equations?

Under-identified: Cannot estimate from data alone
Exactly identified: Unique estimates possible
Over-identified: Multiple estimates possible; use 2SLS
Order condition: (K−k) ≥ (m−1) where K=total exogenous, k=exogenous in equation, m=endogenous in equation

🔢

Estimation Methods

How to Estimate

ILS (Indirect Least Squares): For exactly identified equations
2SLS (Two-Stage Least Squares): Most popular for over-identified. Stage 1: regress endogenous X on instruments; Stage 2: use fitted X̂ in main regression
3SLS / FIML: Full system methods for efficiency

Demand equationQd = α₀ + α₁P + α₂Income + u₁ (structural)

Supply equationQs = β₀ + β₁P + β₂Weather + u₂ (structural)

EquilibriumQd = Qs (market clears)

2SLS Stage 1Regress P on ALL exogenous variables → get P̂

2SLS Stage 2Replace P with P̂ in structural equation → OLS gives consistent estimates

😄 Why OLS Fails Here"Using OLS for a simultaneous system is like trying to figure out who started a fight when both parties hit each other at exactly the same time — you can't tell cause from effect!" Price rises → quantity supplied rises (supply); but quantity demanded falls → price falls (demand). OLS blends these two directions and gives wrong answers for both. 2SLS untangles them using instruments.

· · ·

E10

Time Series

Time Series Econometrics — Stationarity, Unit Roots & Cointegration

⚡ The Spurious Regression Warning!Regressing one non-stationary time series on another can give a high R² and significant t-statistics PURELY BY CHANCE — even if they have nothing to do with each other. Example: Bangladesh rice production and global smartphone sales both trend upward → regressing one on the other gives R²=0.94 but it is COMPLETELY MEANINGLESS. Always test for stationarity first!

📈

Stationarity

The Key Concept in Time Series

A time series is weakly stationary if its mean, variance, and autocovariances are constant over time (don't depend on t). Most economic time series (GDP, prices, exchange rates) are NON-stationary — they have trends and drifts.

⚙️

Unit Root Tests

Testing for Non-stationarity

Augmented Dickey-Fuller (ADF) test: H₀: series has unit root (non-stationary); Reject H₀ → stationary. The most widely used test.
Phillips-Perron (PP) test: Non-parametric correction for serial correlation
KPSS test: H₀: stationary (opposite hypothesis — use alongside ADF)

💡

Cointegration

Long-Run Equilibrium

Two non-stationary I(1) series are cointegrated if their linear combination is stationary I(0). They share a long-run equilibrium relationship. Use Engle-Granger two-step method or Johansen test. If cointegrated: use Error Correction Model (ECM).

🔢

Remedies for Non-stationarity

Making Series Stationary

Differencing: ΔYₜ = Yₜ − Yₜ₋₁ removes unit root (most common)
Log transformation: ln(Yₜ) — stabilises variance and often stationarises
Detrending: Remove deterministic trend by regression
If I(1) and cointegrated → use ECM instead of differencing

Random walk (unit root)Yₜ = Yₜ₋₁ + εₜ (non-stationary; var grows with t)

ADF test equationΔYₜ = α + βYₜ₋₁ + Σγⱼ·ΔYₜ₋ⱼ + εₜ H₀: β=0 (unit root)

Integration order I(d)I(0)=stationary ; I(1)=one difference needed ; I(2)=two differences

Error Correction ModelΔYₜ = α + γ(Yₜ₋₁ − β·Xₜ₋₁) + θΔXₜ + εₜ (short-run + long-run)

🌍 Bangladesh ApplicationTesting whether the taka-dollar exchange rate and domestic price level are cointegrated (Purchasing Power Parity). Both series are I(1). Engle-Granger test finds cointegration — a long-run PPP relationship holds. Estimate ECM: the speed-of-adjustment coefficient γ̂ = −0.23 means 23% of any deviation from long-run PPP is corrected each quarter. Highly useful for monetary policy!

STAT4101 · Multivariate Distribution

🎓 Why Multivariate Analysis? "In real life, nothing happens in isolation." Blood pressure AND cholesterol AND BMI together predict heart disease — not one alone. Multivariate analysis handles p variables simultaneously, capturing their joint distributions, correlations, and interactions. As Johnson & Wichern put it: "Most data sets encountered in practice contain measurements on several variables that must be analyzed jointly." The key advantage: we preserve the covariance structure that gets lost when analyzing variables one at a time. 😄 Joke: "A univariate statistician sees a forest of trees. A multivariate statistician sees the forest, the ecosystem, the relationships between trees, AND the soil composition — all at once!"

Topics — STAT4101 Multivariate Distribution

Aspects of Multivariate Analysis Euclidean & Statistical Distances Matrix Decompositions Covariance Matrix & Generalised Variance Multivariate Normal Distribution MLE of Mean Vector & Covariance Matrix Assessing Multivariate Normality Sampling Distributions & Wishart Inference: Hotelling T² & MANOVA Multivariate Multiple Regression

Introduction

Aspects of Multivariate Analysis

📖

What it is

Meaning & Scope

Multivariate Analysis (MVA) refers to statistical techniques for analysing data with p ≥ 2 variables measured on each observation. Goal: understand the joint behaviour, interdependencies, and structure of these variables simultaneously — not one at a time.

✅

Applications

Where MVA is Used

Medical: Joint analysis of blood pressure, cholesterol, BMI, age for heart disease risk
Ecology: Species abundance across multiple environmental variables
Finance: Portfolio of stocks — returns, risks, correlations simultaneously
Psychology: Intelligence tests measuring multiple cognitive dimensions
Agriculture: Crop yield as function of soil, rain, temperature, fertiliser jointly

💡

Key Concept

The Data Matrix

MVA operates on an n × p data matrix X: n observations (rows), p variables (columns). Each row is a p-dimensional observation vector xᵢ = (x_{i1}, x_{i2}, …, x_{ip})'. The entire dataset is the matrix X of dimension n×p.

⚠️

When NOT to Use

Limitations & Cautions

Requires multivariate normality for many classical methods — always check!
Highly sensitive to outliers — a single bad row can distort everything
Sample size n must be >> p (as a rule: n ≥ 5p minimum)
Interpretation becomes very challenging as p grows large ("curse of dimensionality")

😄 The "Curse of Dimensionality" Joke"In 1D you need 10 points to understand a distribution. In 10D you need 10¹⁰ points — more than the world's population squared. This is why every multivariate statistician is simultaneously excited about p variables and terrified of having too many." — The curse is real, and MVA is largely about fighting it!

· · ·

Distance Measures

Euclidean & Statistical Distance

📏

Euclidean Distance

Ordinary Geometric Distance

The familiar straight-line distance between two points x and y in p-dimensional space: d(x,y) = √[Σᵢ(xᵢ−yᵢ)²]. Simple but has a critical flaw: it treats all variables equally regardless of their scale or correlation. A variable measured in kilometres swamps one measured in centimetres!

🎯

Mahalanobis Distance

Statistical Distance — The MVP

Mahalanobis distance accounts for the scale AND correlation structure of the data via the covariance matrix Σ: d²(x,μ) = (x−μ)'Σ⁻¹(x−μ). It's unit-free and correlation-corrected. Think of it as Euclidean distance in "standardised space" rotated to remove correlations.

💡

Why Mahalanobis?

Advantages Over Euclidean

Scale-invariant — variables on different units treated fairly
Accounts for correlations — correlated variables don't double-count
Identifies multivariate outliers — points far from the centroid in σ units
d²(x,μ) ~ χ²(p) under multivariate normality — useful for outlier detection!

🌍

Real Example

Medical Diagnosis

Patient has systolic BP=140mmHg and age=45 years. Euclidean distance from population mean (120mmHg, 40yrs) = √(20²+5²) = 20.6. But BP and age have different scales AND are correlated. Mahalanobis distance gives a meaningful "how unusual is this patient" measure corrected for both scale and the BP-age correlation.

Euclidean Distanced(x,y) = √[(x−y)'(x−y)] = √[Σᵢ(xᵢ−yᵢ)²]

Statistical (Mahalanobis)d²(x,μ) = (x−μ)' Σ⁻¹ (x−μ)

Sample versiond²(xᵢ,x̄) = (xᵢ−x̄)' S⁻¹ (xᵢ−x̄) ~ χ²(p) approximately

Outlier thresholdFlag xᵢ as outlier if d²(xᵢ,x̄) > χ²_(0.975)(p)

😄 Distance Analogy"Euclidean distance measures 'as the crow flies.' Mahalanobis distance measures 'as the statistician walks' — taking into account the terrain (correlations) and the different scales of measurement (variances). They're both right, but one is much smarter about context."

· · ·

Linear Algebra Tools

Matrix Decompositions — Spectral, Cholesky & Square Root

🔷

Spectral Decomposition

Eigenvalue Decomposition

Every symmetric positive definite matrix A can be decomposed as: A = PΛP' where P = matrix of eigenvectors (orthonormal columns) and Λ = diagonal matrix of eigenvalues λ₁ ≥ λ₂ ≥ … ≥ λₚ > 0. The eigenvectors give the "principal directions" of the data; eigenvalues give the "lengths" in those directions. Foundation of PCA!

🔺

Cholesky Decomposition

Lower-Triangular Factorisation

Every positive definite matrix Σ can be written as Σ = LL' where L is a lower-triangular matrix with positive diagonal entries. Why useful? (1) Simulate multivariate normal data: if Z~N(0,I), then X = μ + LZ ~ N(μ,Σ). (2) Solve linear systems efficiently. (3) Check positive definiteness — Cholesky fails if Σ is not positive definite.

💡

Square Root of Matrix

Matrix Square Root A^(1/2)

Using spectral decomposition: A^(1/2) = PΛ^(1/2)P' where Λ^(1/2) = diag(√λ₁, …, √λₚ). Property: A^(1/2) · A^(1/2) = A. Used to transform data to uncorrelated form: if X ~ Nₚ(μ,Σ), then Σ^(-1/2)(X−μ) ~ Nₚ(0,I) — the "sphering" or "whitening" transformation essential for many multivariate tests.

🔢

Partitioned Covariance

Block Structure of Σ

Partition the p-vector x = (x₍₁₎', x₍₂₎')' into groups of p₁ and p₂ variables. Then Σ = [[Σ₁₁, Σ₁₂],[Σ₂₁, Σ₂₂]] where Σ₁₁=Var(x₍₁₎), Σ₂₂=Var(x₍₂₎), Σ₁₂=Cov(x₍₁₎,x₍₂₎). Used in canonical correlation, conditional distributions, and regression of one group on another.

Spectral DecompositionΣ = PΛP' = Σᵢ λᵢ eᵢeᵢ' (P orthogonal, Λ diagonal)

CholeskyΣ = LL' (L lower triangular, lᵢᵢ > 0)

Matrix Square RootΣ^(1/2) = PΛ^(1/2)P' where Λ^(1/2)=diag(√λ₁,…,√λₚ)

Whitening TransformZ = Σ^(-1/2)(X − μ) ~ Nₚ(0, Iₚ)

Conditional (partitioned)E(X₁|X₂) = μ₁ + Σ₁₂Σ₂₂⁻¹(X₂−μ₂)

😄 Matrix Square Root Joke"Why can't a matrix go to therapy alone? Because it needs its square root to become 'whole' — and its inverse to undo its past mistakes!" More seriously: the matrix square root is what lets us transform any multivariate normal distribution into a standard one, making everything else tractable.

· · ·

Variation in p Dimensions

Covariance Matrix & Generalised Variance

📊

Covariance Matrix Σ

The Multivariate Analogue of Variance

For a p-dimensional random vector X, the covariance matrix Σ (p×p) captures ALL pairwise variances and covariances: σᵢᵢ = Var(Xᵢ) on diagonal; σᵢⱼ = Cov(Xᵢ,Xⱼ) off-diagonal. Σ is symmetric and positive (semi)definite. The sample version S = (n−1)⁻¹Σᵢ(xᵢ−x̄)(xᵢ−x̄)' is the unbiased estimator.

🔢

Generalised Variance

|Σ| — One Number for All Variation

The determinant |Σ| is called the generalised variance — it summarises the total variation in all p variables in a single number. Geometrically: |Σ| is proportional to the squared volume of the p-dimensional ellipsoid formed by the data. |Σ| = 0 means variables are perfectly linearly dependent (degenerate distribution).

💡

Total Variation

Trace of Σ — Alternative Summary

tr(Σ) = σ₁₁ + σ₂₂ + … + σₚₚ = sum of all variances. This is the "total variance" measure. tr(Σ) = Σλᵢ (sum of eigenvalues). Used in PCA: proportion of variance explained by kth PC = λₖ/tr(Σ). Both |Σ| and tr(Σ) are used as scalar measures of multivariate scatter.

🌍

Correlation Matrix

Standardised Version

R = D^(-1/2) Σ D^(-1/2) where D = diag(σ₁₁,…,σₚₚ). All diagonal entries of R = 1; off-diagonal rᵢⱼ ∈ [−1,1]. Working with R (instead of Σ) is equivalent to standardising all variables to unit variance. Most MVA methods can work with either Σ or R — the choice matters for interpretation!

Population ΣΣ = E[(X−μ)(X−μ)'] (p×p symmetric positive definite)

Sample SS = (1/(n−1)) · Σᵢ(xᵢ−x̄)(xᵢ−x̄)'

Generalised Variance|S| = det(S) (volume of data ellipsoid)

Total Variancetr(S) = s₁₁ + s₂₂ + … + sₚₚ = Σᵢ λᵢ

Correlation MatrixR = D^(-1/2) S D^(-1/2) (D = diag of variances)

· · ·

Core Distribution

The Multivariate Normal Distribution

🔔

Definition & Meaning

Nₚ(μ, Σ)

A p-dimensional random vector X follows a multivariate normal distribution Nₚ(μ,Σ) if every linear combination a'X is (univariate) normal for any non-zero vector a. Parameters: mean vector μ (p×1) — location; covariance matrix Σ (p×p) — shape and spread. The MVN is completely characterised by just these two parameters!

📐

Properties

Key Properties of MVN

Marginals are normal: Each Xᵢ ~ N(μᵢ, σᵢᵢ)
Conditionals are normal: (X₁|X₂=x₂) ~ N(μ₁.₂, Σ₁₁.₂)
Linear combinations: AX+b ~ N(Aμ+b, AΣA')
Uncorrelated → Independent: UNIQUE to MVN! If Cov(Xᵢ,Xⱼ)=0 then Xᵢ⊥Xⱼ
Quadratic forms: (X−μ)'Σ⁻¹(X−μ) ~ χ²(p)

💡

Contours & Geometry

Elliptical Contours

Contours of constant density for MVN are ellipsoids in p-dimensional space: {x : (x−μ)'Σ⁻¹(x−μ) = c²}. The shape/orientation is determined by Σ. Axes of the ellipse = eigenvectors of Σ; lengths proportional to √λᵢ. In 2D: a tilted ellipse if variables are correlated, circles if uncorrelated.

⚠️

Important Caution

Marginals Normal ≠ Joint Normal

Each variable being normally distributed does NOT imply joint multivariate normality! A classic counterexample: X~N(0,1) and Y = X if |X|>1, Y = −X otherwise. Then X~N, Y~N but (X,Y) is NOT bivariate normal. Always test joint normality, not just marginals!

MVN pdff(x) = (2π)^(-p/2)|Σ|^(-1/2) exp[-½(x−μ)'Σ⁻¹(x−μ)]

Density contour(x−μ)'Σ⁻¹(x−μ) = c² (p-dim ellipsoid)

Linear transformAX + b ~ Nₖ(Aμ+b, AΣA') (A: k×p)

Quadratic form(X−μ)'Σ⁻¹(X−μ) ~ χ²(p)

Conditional dist.X₁|X₂=x₂ ~ N(μ₁+Σ₁₂Σ₂₂⁻¹(x₂−μ₂), Σ₁₁−Σ₁₂Σ₂₂⁻¹Σ₂₁)

Bivariate Normal Contours — Different Correlation Structures

· · ·

Estimation

MLE of Mean Vector & Covariance Matrix

🎯

MLE of μ

Sample Mean Vector

The MLE of the mean vector μ is simply the sample mean vector x̄ = (1/n)Σᵢxᵢ. It is unbiased E(x̄)=μ and its sampling distribution is: x̄ ~ Nₚ(μ, Σ/n). Larger n → smaller variance of x̄ → more precise estimate. Intuition: just average each variable separately.

⚙️

MLE of Σ

MLE vs Unbiased Estimator

MLE: Σ̂ = (1/n)Σᵢ(xᵢ−x̄)(xᵢ−x̄)' — biased (uses n, not n−1)
Unbiased S: S = (1/(n−1))Σᵢ(xᵢ−x̄)(xᵢ−x̄)' — used in practice
MLE is biased by factor (n−1)/n — for large n, difference negligible
Both are consistent estimators (converge to Σ as n→∞)

💡

Sufficiency

Sufficient Statistics for MVN

For MVN data, (x̄, S) is a jointly sufficient statistic for (μ, Σ) — meaning all information in the sample about the parameters is captured by the sample mean vector and sample covariance matrix. No other summary can add more information. This is the multivariate analogue of the fact that (x̄, s²) is sufficient for (μ,σ²) in univariate normal.

📈

Large Sample Behaviour

Asymptotic Results

√n(x̄ − μ) → Nₚ(0, Σ) as n→∞ (multivariate CLT)
n·(x̄−μ)'S⁻¹(x̄−μ) → χ²(p) as n→∞
S → Σ in probability (consistency)
These are the basis for large-sample inference about μ

MLE of μμ̂ = x̄ = (1/n) Σᵢ xᵢ (unbiased)

MLE of Σ (biased)Σ̂ = (1/n) Σᵢ(xᵢ−x̄)(xᵢ−x̄)'

Unbiased SS = (1/(n−1)) Σᵢ(xᵢ−x̄)(xᵢ−x̄)' (used in tests)

Distribution of x̄x̄ ~ Nₚ(μ, Σ/n)

Multivariate CLT√n(x̄ − μ) →_d Nₚ(0, Σ) as n→∞

· · ·

Diagnostics

Assessing Multivariate Normality

🔬

Step 1: Marginal Checks

Univariate Marginal Normality

Plot histogram and Q-Q plot for each variable separately
Shapiro-Wilk or Kolmogorov-Smirnov test for each Xⱼ
Check for skewness and kurtosis near 0 and 3 respectively
Warning: All marginals normal ≠ joint MVN! This is necessary but NOT sufficient

📊

Step 2: Bivariate Checks

P-P and Q-Q Plots

Bivariate Q-Q plot: Plot ordered chi-square quantiles vs ordered Mahalanobis distances d²ᵢ
If MVN: points should fall approximately on a 45° line
Bivariate scatter: Should show elliptical pattern for each pair
Departures from ellipse indicate non-normality or outliers

💡

Step 3: Outlier Detection

Finding Multivariate Outliers

Compute d²ᵢ = (xᵢ−x̄)'S⁻¹(xᵢ−x̄) for each observation
Under MVN: d²ᵢ ≈ χ²(p)
Flag observations with d²ᵢ > χ²_{0.975}(p) as potential outliers
An outlier in a single variable may not appear as multivariate outlier and vice versa!

🔄

Step 4: Transformations

Achieving Near-Normality

Square root √x: Right-skewed count data (Poisson-like)
Log ln(x): Right-skewed positive data (income, concentrations)
Logit ln[p/(1-p)]: Proportions data bounded in (0,1)
Box-Cox: x^(λ) — λ estimated from data; λ=0 gives log, λ=0.5 gives √x
Fisher's z = ½ln[(1+r)/(1-r)]: For correlation coefficients

Chi-sq Q-Q plotPlot d²₍ᵢ₎ vs χ²_{i/(n+1)}(p) — should be linear if MVN holds

Box-Cox transformx^(λ) = (xλ−1)/λ if λ≠0 ; ln(x) if λ=0 (choose λ maximising normality)

Outlier thresholdd²ᵢ > χ²_{0.975}(p) → potential multivariate outlier

Fisher's zz = 0.5 · ln[(1+r)/(1−r)] ~ N(0.5·ln[(1+ρ)/(1−ρ)], 1/(n−3))

😄 Transformation Tip"Transforming data to normality is like ironing a wrinkled shirt — the content (information) doesn't change, but the shape becomes much more manageable. The Box-Cox transformation is like an automatic iron that figures out the right temperature (λ) by itself!" Remember to always report which transformation was used so results can be back-transformed for interpretation.

· · ·

Sampling Theory

Wishart Distribution & Sampling Distributions

📐

Wishart Distribution

Multivariate Analogue of χ²

If X₁,…,Xₙ are iid Nₚ(0,Σ), then the matrix W = Σᵢ XᵢXᵢ' ~ Wₚ(n,Σ) follows a Wishart distribution with n degrees of freedom and scale matrix Σ. The sample covariance matrix satisfies: (n−1)S ~ Wₚ(n−1,Σ). It is the matrix generalisation of the chi-square distribution — just as s² has a chi-square distribution in univariate normal, S has a Wishart distribution!

⚙️

Properties of Wishart

Key Facts

E(W) = nΣ — so E(S) = Σ (unbiased)
If p=1: W reduces to σ²χ²(n) — the familiar univariate result
Reproductive: W₁~Wₚ(n₁,Σ) + W₂~Wₚ(n₂,Σ) → W₁+W₂~Wₚ(n₁+n₂,Σ)
Used in construction of Hotelling T² and Wilks' Lambda test statistics

💡

Key Sampling Results

Distribution of x̄ and S

x̄ and S are independent when sampling from MVN (multivariate analogue of independence of x̄ and s²)
x̄ ~ Nₚ(μ, Σ/n)
(n−1)S ~ Wₚ(n−1, Σ)
n(x̄−μ)'S⁻¹(x̄−μ) ~ [p(n−1)/(n−p)] · Fₚ,ₙ₋ₚ (Hotelling T²)

Wishart(n−1)S ~ Wₚ(n−1, Σ) when X₁,…,Xₙ iid Nₚ(μ,Σ)

E(S)E(S) = Σ (unbiased)

Independencex̄ ⊥ S (when sampling from MVN)

Hotelling T² dist.T² = n(x̄−μ₀)'S⁻¹(x̄−μ₀) ~ [p(n−1)/(n−p)] · Fₚ,ₙ₋ₚ

· · ·

Inference

Hotelling T² & MANOVA

🔬

Hotelling's T²

Multivariate t-Test

Tests H₀: μ = μ₀ (mean vector equals a specified vector). Hotelling's T² = n(x̄−μ₀)'S⁻¹(x̄−μ₀). This is the multivariate generalisation of the one-sample t-test. Under H₀: [(n−p)/p(n−1)]·T² ~ Fₚ,ₙ₋ₚ. Reject H₀ if this exceeds Fₐ(p,n−p). The TWO-SAMPLE version tests H₀: μ₁=μ₂ using the pooled covariance matrix.

📊

MANOVA

Multivariate ANOVA

MANOVA tests whether group mean vectors are equal: H₀: μ₁=μ₂=…=μg. Decomposes the total scatter matrix T into: T = H + E where H=between-group (hypothesis) matrix and E=within-group (error) matrix. Tests use functions of H and E — primarily Wilks' Lambda Λ = |E|/|H+E|.

💡

MANOVA Test Statistics

Four Equivalent Tests

Wilks' Lambda: Λ = |E|/|T| — most widely used
Pillai's Trace: tr(H(H+E)⁻¹)
Hotelling-Lawley Trace: tr(HE⁻¹)
Roy's Largest Root: λ₁/(1+λ₁) — most powerful for single-direction alternatives
All four equivalent in large samples; differ for small n or specific alternatives

⚠️

MANOVA Assumptions

Requirements

Multivariate normality within each group
Homogeneity of covariance matrices: Σ₁=Σ₂=…=Σg (Box's M test)
Independence of observations
n > p (more obs than variables — essential!)
⚠ If assumptions violated → use permutation MANOVA (vegan package)

Hotelling T²T² = n(x̄−μ₀)'S⁻¹(x̄−μ₀)

T² to FF = [(n−p)/p(n−1)] · T² ~ Fₚ,ₙ₋ₚ under H₀

MANOVA decomp.T = H + E (Total = Between + Within)

Wilks' LambdaΛ = |E| / |H + E| ∈ (0,1] (Λ≈1 → H₀ not rejected)

Pillai's TraceV = tr[H(H+E)⁻¹] = Σᵢ λᵢ/(1+λᵢ)

😄 MANOVA Analogy"MANOVA is like ANOVA but instead of asking 'do these groups have different means on ONE measure?' it asks 'do these groups differ on ANY combination of ALL measures simultaneously?' It's like comparing entire personality profiles rather than just one trait. Much more powerful when variables are correlated!" — And Wilks' Lambda is like the p-value's sophisticated older sibling who considers the whole picture.

· · ·

M10

Prediction

Multivariate Multiple Regression

📉

What it is

Multiple Y, Multiple X

Multivariate multiple regression has multiple response variables Y (n×m matrix) AND multiple predictors X (n×(k+1) matrix). Model: Y = XB + E where B (k+1)×m is the coefficient matrix and E is n×m error matrix. Each column of Y is a separate response; they share the same predictors X.

⚙️

Estimation

Matrix OLS

OLS estimator: B̂ = (X'X)⁻¹X'Y. Each column of B̂ is the OLS solution for that response variable separately — so multivariate regression is equivalent to running m separate univariate regressions! However, joint analysis is more efficient and enables tests involving ALL responses simultaneously.

💡

Why Use Jointly?

Advantage of Joint Analysis

Tests on coefficient matrix B involving multiple responses simultaneously
Accounts for correlations among response variables → more powerful tests
Can test hypotheses of form CBM = 0 (general linear hypothesis)
Residual covariance matrix Ê'Ê/(n−k−1) estimates Σ — the cross-response correlations

Model (matrix)Y(n×m) = X(n×(k+1)) · B((k+1)×m) + E(n×m)

OLS estimatorB̂ = (X'X)⁻¹X'Y

Residual matrixÊ = Y − XB̂ = (I − H)Y where H=X(X'X)⁻¹X'

Error covariance est.Σ̂ = Ê'Ê/(n−k−1)

General hypothesisH₀: CBM = 0 → test via Wilks' Λ or Hotelling trace

STAT4201 · Multivariate Analysis II

🎓 The Big Picture of MVA II Where Multivariate I asked "how are variables distributed and how do we test hypotheses about means?", Multivariate II asks "what structure is hidden in the data?" PCA finds orthogonal dimensions of maximum variance. Factor Analysis finds latent constructs driving correlations. Cluster Analysis groups similar observations. Discriminant Analysis builds rules to classify new observations. Together, these are the core of unsupervised and supervised multivariate learning. 😄 "MVA II is where statistics starts looking suspiciously like machine learning — because it basically is!"

Topics — STAT4201 Multivariate Analysis II

Principal Component Analysis (PCA) Independent Component Analysis (ICA) Factor Analysis Cluster Analysis Discriminant & Classification Analysis

Dimension Reduction

Principal Component Analysis (PCA)

😄 PCA Analogy"PCA is like finding the best angle to photograph a 3D sculpture so it reveals the most information in a 2D photo. You rotate your perspective to capture maximum variance in each new direction — the first principal component is the angle with the best overall view, the second adds what the first missed, and so on!" Each photo (PC) is orthogonal to the others.

📊

What it is

Finding Maximum Variance Directions

PCA transforms p correlated variables into p uncorrelated Principal Components (PCs) that are linear combinations of the originals. PC1 captures maximum variance; PC2 captures maximum of remaining variance orthogonal to PC1; and so on. Goal: represent data in fewer dimensions with minimal information loss.

⚙️

How PCA Works

The Eigenvalue Approach

Compute S (or R for standardised PCA)
Find eigenvalues λ₁≥λ₂≥…≥λₚ and eigenvectors e₁,e₂,…,eₚ of S
ith PC: Yᵢ = eᵢ'X (linear combination with eigenvector weights)
Var(Yᵢ) = λᵢ; Cov(Yᵢ,Yⱼ) = 0 for i≠j
Retain k PCs where Σᵢ₌₁ᵏ λᵢ/tr(S) ≥ 0.80 (80% variance rule)

💡

Choosing # of PCs

How Many to Keep?

80% variance rule: Keep enough PCs to explain ≥80% of total variance
Scree plot: Plot λᵢ vs i; look for "elbow" — PCs before the bend
Kaiser criterion: Keep PCs with λᵢ > 1 (from R, not S)
Cross-validation: Prediction error-based selection

⚠️

Conditions & Cautions

When to Use PCA

✅ Use when: variables are correlated; dimension reduction needed; no interpretability required
❌ Don't use when: variables are already uncorrelated (PCA adds nothing)
⚠ PCA components have no natural interpretation — a mix of original variables
⚠ Sensitive to scale — standardise first (use R not S) if variables on different scales
⚠ PCA is unsupervised — it ignores any response/class labels

PCs from eigendecomp.S = PΛP' → Yᵢ = eᵢ'(X−x̄) (ith PC)

Variance of ith PCVar(Yᵢ) = λᵢ

Proportion explainedPVEᵢ = λᵢ / Σⱼλⱼ = λᵢ / tr(S)

Loadingslᵢⱼ = eᵢⱼ · √λᵢ (correlation between PC i and variable j scaled)

Communalityhⱼ² = Σᵢ lᵢⱼ² (variance of Xⱼ explained by retained PCs)

🌍 Real Application: Socioeconomic IndexBangladesh district data: 8 variables (income, education, health access, sanitation, literacy, employment, poverty rate, infrastructure). PCA extracts PC1 (accounts for 62% variance) which has high positive loadings on income, education, infrastructure and negative loading on poverty — this is a "development index" that can rank districts. Avoids multicollinearity issues in regression by replacing 8 correlated variables with 2-3 orthogonal PCs.

· · ·

Signal Separation

Independent Component Analysis (ICA)

🎵

What it is

Beyond Uncorrelated — Finding Independence

ICA decomposes X = AS + noise where S are statistically independent source signals and A is the mixing matrix. Goal: estimate A and recover S. Unlike PCA (finds uncorrelated components), ICA finds components that are statistically independent — a much stronger condition. Non-Gaussian sources are required!

🔊

The Cocktail Party Problem

Classic Motivation

Imagine p microphones recording a party with p speakers talking simultaneously. Each microphone records a mixture of all voices. ICA recovers the individual voices (independent sources) from the mixed recordings. Applications: EEG/fMRI brain signal separation, audio source separation, financial return decomposition, image processing.

💡

ICA vs PCA

Key Differences

PCA: Finds uncorrelated components (2nd-order statistics only)
ICA: Finds statistically independent components (uses higher-order statistics)
PCA components: Gaussian by CLT — but Gaussians with zero covariance ARE independent → PCA = ICA for Gaussian data
ICA condition: At most one component can be Gaussian
ICA is not unique up to sign and scaling — unlike PCA

⚠️

When to Use ICA

Conditions

✅ When sources are truly statistically independent (not just uncorrelated)
✅ When sources are non-Gaussian (critical assumption!)
✅ Signal/source separation problems
❌ Don't use when sources are Gaussian — PCA is equivalent and simpler
⚠ ICA components are not ordered by variance (unlike PCA)

ICA modelX = AS (X=observed, A=mixing matrix, S=independent sources)

SeparationŜ = WX where W = A⁻¹ is the unmixing matrix (estimated by ICA)

Non-GaussianityMaximise negentropy: J(y) ≈ [E{G(y)} − E{G(z)}]² (G non-quadratic function)

FastICA updatew ← E[Xg(w'X)] − E[g'(w'X)]w (gradient-based iteration)

· · ·

Latent Structure

Factor Analysis — Discovering Hidden Constructs

😄 Factor Analysis Analogy"Factor analysis is like figuring out that what students score on reading, writing, and comprehension tests is really driven by a single underlying construct: 'verbal intelligence.' You can't directly measure verbal intelligence, but you can observe its effects on multiple tests. Factor analysis extracts these invisible factors that drive the observable correlations." — Widely used in psychology, social science, and education.

🔍

What it is

Latent Factor Model

Factor Analysis (FA) models p observed variables as linear combinations of m << p latent (unobservable) common factors F plus unique factors: X = μ + LF + ε. L is the (p×m) loading matrix; F are m common factors; ε are p unique (specific) factors. Goal: interpret the common factors as meaningful latent constructs.

⚙️

FA vs PCA

Critical Differences

PCA: Explains total variance; components are explicit linear combos of X; descriptive
FA: Explains common variance only (not unique/error variance); factors are latent unobservables; model-based
PCA: Unique solution; components are ordered by variance
FA: Solution not unique — rotation can be applied to improve interpretability!

💡

Factor Rotation

Making Factors Interpretable

Orthogonal rotation (Varimax): Maximises variance of squared loadings per column — produces "simple structure" where each variable loads highly on one factor and near-zero on others. Factors remain uncorrelated.
Oblique rotation (Promax, Oblimin): Allows factors to be correlated — more realistic when latent constructs are related (e.g., verbal and mathematical intelligence are correlated)

⚠️

Conditions

When FA is Appropriate

✅ Variables are correlated (|R|<1) — if uncorrelated, no common factors exist
✅ You believe latent constructs drive the correlations (theory-driven)
✅ Communalities h² should be reasonable — if all h²≈0, model fails
❌ Don't use FA when all variance is unique — use PCA instead
⚠ Factor identification requires subjective interpretation — what does Factor 1 "mean"?

Factor ModelX − μ = L·F + ε (L: loadings p×m; F: factors m×1; ε: unique factors)

Covariance structureΣ = LL' + Ψ (Ψ = diag unique variances)

Communalityhⱼ² = Σₖ lⱼₖ² (proportion of Var(Xⱼ) explained by common factors)

Uniquenessψⱼ = 1 − hⱼ² (proportion unexplained by common factors)

Factor scoresF̂ = L'Σ⁻¹(X−μ) (Bartlett's method)

🌍 Bangladesh Application: Poverty Index10 district-level variables measured: income, education, sanitation, health access, child mortality, malnutrition, electricity, road access, drinking water quality, school enrolment. FA extracts 3 factors: Factor 1 (high loadings on income, electricity, roads) = "Infrastructure & Economy"; Factor 2 (health access, child mortality, malnutrition) = "Health Status"; Factor 3 (education, school enrolment) = "Human Capital". These factors become inputs to a multidimensional poverty index. Much more interpretable than raw 10-variable data!

· · ·

Unsupervised Grouping

Cluster Analysis — Finding Natural Groups

😄 Clustering Joke"Cluster analysis is what you do when you have data but no one told you what groups exist. It's like showing up at a party where you know nobody — after a while you notice people naturally cluster by interest, age group, or how loudly they speak. Cluster analysis does this mathematically, without you having to mingle!" The key challenge: you don't know the 'right' answer — there's no objective truth in unsupervised learning.

🗂️

What it is

Grouping Without Labels

Cluster analysis partitions n observations into g groups (clusters) such that observations within a cluster are similar and observations between clusters are dissimilar. It is unsupervised — no predefined groups or labels. Goal: discover natural structure in data.

⚙️

Hierarchical Clustering

Building a Dendrogram

Agglomerative (bottom-up): Start with n clusters (each obs = 1 cluster); merge closest pair; repeat until all in 1 cluster. Most common.
Divisive (top-down): Start with 1 cluster; split recursively
Linkage methods: Single (minimum distance), Complete (maximum), Average (UPGMA), Ward's (minimise within-cluster variance)
Result: Dendrogram — cut at desired level to get g clusters

💡

K-Means Clustering

Iterative Partitioning

Specify k (number of clusters) in advance
Algorithm: (1) Assign each obs to nearest centroid; (2) Update centroids as cluster means; (3) Repeat until convergence
Minimises: Σₖ Σᵢ∈Cₖ ‖xᵢ − μₖ‖² (within-cluster sum of squares)
Sensitive to initial centroids — run multiple times with random starts
Choosing k: Elbow plot, Silhouette coefficient, Gap statistic

⚠️

Conditions & Cautions

When Each Method Works

✅ Hierarchical: Small-medium n; want to see ALL possible groupings; no need to prespecify k
✅ K-means: Large n; approximately spherical clusters; k known or can be estimated
❌ K-means fails for non-spherical clusters (use DBSCAN, Gaussian mixture models)
⚠ Scale matters enormously — standardise variables before clustering!
⚠ No "correct" clustering — always validate with external criteria

K-Means objectiveMinimise W(C) = Σₖ₌₁ᴷ Σᵢ∈Cₖ ‖xᵢ − μ̄ₖ‖²

Single linkage dd(A,B) = min{d(a,b) : a∈A, b∈B}

Complete linkage dd(A,B) = max{d(a,b) : a∈A, b∈B}

Ward's linkageMerge A and B if merger minimises increase in total within-cluster SS

Silhouettes(i) = [b(i)−a(i)] / max{a(i),b(i)} ∈ [−1,1] (higher=better cluster)

🌍 Bangladesh Health Cluster64 districts clustered on 6 health indicators. K-means (k=3 chosen by elbow plot) identifies: Cluster 1 (12 districts, Dhaka-centred) = high healthcare access, low mortality; Cluster 2 (28 districts) = moderate on all indicators; Cluster 3 (24 districts, Char/haor areas) = low access, high child mortality, high malnutrition. This clustering directly informs resource allocation for the Ministry of Health — districts in Cluster 3 receive priority funding.

· · ·

Supervised Classification

Discriminant & Classification Analysis

😄 Discriminant Analysis Analogy"Discriminant analysis is like training a sorting machine. You show it thousands of labelled patients ('has disease' / 'no disease') along with their test results. It learns the pattern of test results that best separates the groups. Then when a new patient arrives with only test results (no diagnosis), the machine classifies them. Fisher's Linear Discriminant is one of the oldest and most elegant classification algorithms — predating neural networks by 80+ years!"

🎯

What it is

Supervised Group Separation

Discriminant Analysis has TWO goals: (1) Description: find linear combinations of variables (discriminant functions) that best separate g known groups; (2) Classification: build a rule to assign future observations to one of the g groups. Unlike cluster analysis: group memberships are KNOWN for the training data.

⚙️

Fisher's LDA

Linear Discriminant Analysis

Find direction w that maximises between-group variance / within-group variance: w = Sₚ⁻¹(x̄₁ − x̄₂) for 2-group case
Classify new x to group 1 if: w'x ≥ midpoint(w'x̄₁, w'x̄₂)
Assumes equal covariance matrices Σ₁=Σ₂ → uses pooled Sₚ
For g>2: compute g−1 discriminant functions

💡

Probabilistic Classification

Bayes Classification Rules

Linear discriminant rule (LDA): Equal Σ → linear boundary
Quadratic discriminant (QDA): Unequal Σ → quadratic boundary (more flexible)
Classify x to group g* = argmax Pᵢ·f(x|group i) (posterior probability)
Minimum ECM rule: Accounts for unequal misclassification costs and prior probabilities

⚠️

Evaluating Performance

Classification Error

APER: Apparent Error Rate — fraction misclassified on training data (optimistically biased)
Cross-validated error: Leave-one-out cross-validation — better estimate of true error
Confusion matrix: Rows = actual groups; columns = predicted groups; diagonal = correct classifications
⚠ APER always underestimates true misclassification rate — always use CV!

Fisher's discriminantw = Sₚ⁻¹(x̄₁ − x̄₂) (direction of maximum separation)

Pooled covarianceSₚ = [(n₁−1)S₁ + (n₂−1)S₂] / (n₁+n₂−2)

LDA classificationAssign x to g1 if (x̄₁−x̄₂)'Sₚ⁻¹[x − ½(x̄₁+x̄₂)] ≥ ln(π₂/π₁)

Mahalanobis criterionAssign x to group g* = argminᵢ D²(x, x̄ᵢ) (nearest centroid)

APERAPER = (# misclassified) / n (training error — always optimistic)

🌍 Bangladesh Medical ApplicationClassifying TB patients into 3 treatment response groups (rapid/moderate/slow responder) based on 6 baseline clinical variables (age, BMI, sputum grade, haemoglobin, ESR, CD4 count). LDA builds two discriminant functions. Cross-validated APER = 18% (82% correctly classified). The discriminant scores of new patients can be computed from their baseline labs to predict treatment response category — guiding personalised treatment decisions before expensive sensitivity testing is complete.

STAT2201 · Sampling Distribution

🎓 What is a Sampling Distribution? "If you took your sample 10,000 times and computed the mean each time, what would the distribution of those means look like?" THAT is the sampling distribution — not the distribution of data, but the distribution of a statistic over repeated sampling. 😄 "The sampling distribution is the bridge between data and inference — without it, statistics would just be fancy arithmetic."

Topics — STAT2201

Population vs Sample · Parameters vs Statistics Sampling Distribution of the Mean Central Limit Theorem (CLT) Chi-Square Distribution · Sampling Distribution of s² Student's t Distribution F Distribution Sampling Distribution of Proportions Point & Interval Estimation

Foundations

Population vs Sample · Parameters vs Statistics

🌍

Population & Parameters

The Complete Set

Population (N): All items of interest — fixed but usually unobservable
Parameter: μ, σ², π — numerical summaries of the population, FIXED but UNKNOWN
Almost never observe the whole population — too large, costly, or destructive

🔬

Sample & Statistics

What We Actually Observe

Sample (n): n observations drawn from the population
Statistic: x̄, s², p̂ — functions of the sample; RANDOM VARIABLE before sampling
The KEY insight: statistics vary sample to sample — this variation has a pattern = sampling distribution

💡

3 Different Distributions

Never Confuse These!

Population distribution: All individuals — shape could be anything
Sample distribution: Your n observations — approximates population
Sampling distribution: Distribution of the STATISTIC over repeated samples
😄 "Confusing these three is the #1 intro-stats mistake. The CLT applies to the THIRD one!"

⚠️

Standard Error

SE ≠ SD

SD: Variability of individual observations (fixed, doesn't shrink with n)
SE(x̄): Variability of the sample MEAN over repeated samples = σ/√n
SE shrinks as n increases — more data → more precise estimate of μ

Standard Error of x̄SE(x̄) = σ/√n (decreases with n — more data = more precise)

UnbiasednessE(x̄) = μ ; E(s²) = σ² (why we divide by n−1, not n)

· · ·

Key Result

Sampling Distribution of the Mean

📊

Normal Population

Exact Result (Any n)

If X₁,…,Xₙ iid N(μ,σ²), then x̄ ~ N(μ, σ²/n) exactly for any n. Standardise: Z = (x̄−μ)/(σ/√n) ~ N(0,1). When σ unknown, replace with s → T = (x̄−μ)/(s/√n) ~ t(n−1).

💡

Effect of n

More Data = Narrower Distribution

Larger n → smaller SE = σ/√n → sampling distribution narrows around μ
Doubling n reduces SE by √2 ≈ 1.41 (not by 2 — diminishing returns!)
To halve the SE, you must QUADRUPLE n — sampling is expensive!

x̄ from N(μ,σ²)x̄ ~ N(μ, σ²/n) exact for any n

Z statistic (σ known)Z = (x̄ − μ) / (σ/√n) ~ N(0,1)

t statistic (σ unknown)T = (x̄ − μ) / (s/√n) ~ t(n−1)

· · ·

The Crown Jewel

Central Limit Theorem (CLT)

👑

The CLT

Most Important Theorem in Statistics

Let X₁,…,Xₙ be iid with mean μ and finite variance σ². Then as n→∞: √n(x̄−μ)/σ →_d N(0,1) regardless of the population distribution shape. For large n: x̄ ≈ N(μ, σ²/n). This is why the normal distribution appears everywhere!

💡

Why it's Magic

Any Population → Normal x̄

Population can be exponential, uniform, skewed, bimodal — doesn't matter!
n ≥ 30: CLT approximation usually good; n ≥ 50 for very skewed populations
Foundation for: t-tests, z-tests, ANOVA, regression inference, and almost everything
😄 "CLT: Statistics' superhero. No matter what messy distribution you throw at it — average enough and you get normal. Every time."

⚠️

When CLT Fails

Important Exceptions

Cauchy distribution: NO finite variance → CLT doesn't apply
Very small n with highly skewed data
Dependent observations: standard CLT requires independence
Always use exact t/F/χ² when population is exactly normal

CLT (formal)√n(x̄ − μ)/σ →_d N(0,1) as n→∞ (iid, finite σ²)

Practical formx̄ ≈ N(μ, σ²/n) for n≥30 approximately

Sum versionSₙ = ΣXᵢ ≈ N(nμ, nσ²) for large n

· · ·

Variance Inference

Chi-Square, t & F Distributions

📐

Chi-Square χ²(k)

Sum of Squared Normals

χ²(k) = Z₁²+…+Zₖ² where Zᵢ iid N(0,1)
Mean=k; Var=2k; Right-skewed; always ≥ 0
Sampling dist of variance: (n−1)s²/σ² ~ χ²(n−1)
⚠ Requires population normality — sensitive to departures!

🍺

t Distribution

Z ÷ √(χ²/ν) — The Guinness Distribution

t(ν) = Z/√(χ²(ν)/ν); heavier tails than N(0,1)
T = (x̄−μ)/(s/√n) ~ t(n−1) when sampling from N(μ,σ²)
As ν→∞: t(ν) → N(0,1)
😄 "Invented by Gosset at Guinness Brewery — published as 'Student' because Guinness prohibited employee publications. Cheers to small samples! 🍺"

💡

F Distribution

Ratio of Two Chi-Squares

F(k₁,k₂) = [χ²(k₁)/k₁] / [χ²(k₂)/k₂]
F = s₁²/s₂² ~ F(n₁−1,n₂−1) for variance ratio test
F = MSA/MSE in ANOVA; t²(ν) = F(1,ν)
Named for Ronald Fisher — inventor of ANOVA, p-values, and experimental design

χ² from sample variance(n−1)s²/σ² ~ χ²(n−1) (population normal)

CI for σ²[(n−1)s²/χ²_{α/2}, (n−1)s²/χ²_{1−α/2}]

Two-sample t (equal σ)T = (x̄₁−x̄₂)/(sₚ√(1/n₁+1/n₂)) ~ t(n₁+n₂−2)

CI for μ (σ unknown)x̄ ± t_{α/2,n−1} · s/√n

Sample size for μn = (z_{α/2} · σ / E)² (E = desired margin of error)

🌍 Bangladesh ExampleA nutritionist samples 40 children aged 5–10 from Rangpur to estimate mean height. Sample mean = 112 cm, s = 8.4 cm. 95% CI: 112 ± t_{0.025,39} × 8.4/√40 = 112 ± 2.023 × 1.33 = [109.3, 114.7] cm. We are 95% confident the true population mean height is between 109.3 and 114.7 cm. To halve the margin of error, we would need n = 4×40 = 160 children — quadrupling the sample!

· · ·

Proportions

Sampling Distribution of Proportions & Estimation

📈

Sample Proportion

For Binary Outcomes

p̂ = X/n where X~Binomial(n,p). E(p̂)=p (unbiased); Var(p̂)=p(1−p)/n. By CLT: p̂ ≈ N(p, p(1−p)/n) when np≥10 AND n(1−p)≥10. Standard error: SE(p̂) = √[p(1−p)/n].

💡

CI Interpretation

What 95% CI Really Means

A 95% CI: if repeated sampling 100 times and CI computed each time, about 95 of those intervals contain the true μ. It does NOT mean "95% probability μ is in this specific interval" — μ is fixed! 😄 "The CI is a fishing net — 95% of the time it catches the fish (μ). Once cast, the fish is either inside or not."

p̂ approx. dist.p̂ ≈ N(p, p(1−p)/n) for large n (np≥10 AND n(1−p)≥10)

95% CI for pp̂ ± 1.96 · √[p̂(1−p̂)/n] (Wald interval)

Sample size for pn = z²_{α/2} · p(1−p) / E² (use p=0.5 if unknown)

STAT2203 · Analysis of Variance & Design of Experiment

🎓 ANOVA in one sentence ANOVA tests whether means of 3+ groups differ — by comparing BETWEEN-group variance to WITHIN-group variance. Why not just do many t-tests? With g groups you'd need C(g,2) t-tests, inflating Type I error massively. ANOVA controls this with ONE test. 😄 "ANOVA: Statistics' way of comparing all your groups at once, without letting false alarms pile up." Fisher's golden rule of DOE: "Block what you can; randomise what you cannot."

Topics — STAT2203

One-Way ANOVA — Model & F-Test Assumptions & Diagnostics Post-Hoc Multiple Comparisons Two-Way ANOVA & Interaction CRD · RBD · LSD Factorial & 2ᵏ Designs

Core Test

One-Way ANOVA

📊

What it is

Comparing g Group Means

H₀: μ₁=μ₂=…=μg vs H₁: at least one μᵢ differs. Partitions total variation: SST = SSA + SSE. If between-group (MSA) >> within-group (MSE), groups differ. F = MSA/MSE ~ F(g−1, N−g) under H₀.

⚙️

The Logic

Why Variance Tests Means

MSA (between): Measures group-mean differences — large if μᵢ differ
MSE (within): Measures random error — unaffected by group differences
Under H₀: both estimate σ² → F≈1. Under H₁: MSA >> MSE → F >> 1

💡

Effect Size

η² — How Meaningful Is the Effect?

η² = SSA/SST: proportion of variance explained. Benchmarks: small=0.01, medium=0.06, large=0.14. ALWAYS report — significant F with tiny η² means real but trivially small difference! 😄 "Statistical significance ≠ practical importance."

⚠️

What ANOVA Doesn't Tell

Which Groups Differ?

Significant F only says "at least one mean differs" — need post-hoc tests to find WHICH pairs differ. Never claim the group with highest mean is significantly different without a post-hoc test — that's data dredging!

One-Way ANOVA Table

ANOVA modelYᵢⱼ = μ + αᵢ + εᵢⱼ (αᵢ=group effect, Σαᵢ=0)

F statisticF = MSA/MSE ~ F(g−1, N−g) under H₀

Effect size η²η² = SSA/SST (small=0.01, medium=0.06, large=0.14)

· · ·

Checking the Model

Assumptions, Diagnostics & Post-Hoc Tests

📋

3 Assumptions

Independence · Equal Variance · Normality

Independence: All observations independent — ensured by randomisation
Homoscedasticity: σ₁²=…=σg² — test with Levene's test (robust) or Bartlett's (sensitive to non-normality)
Normality: Residuals eᵢⱼ=yᵢⱼ−ȳᵢ. ~ N(0,σ²) — check Q-Q plot or Shapiro-Wilk

🔍

Post-Hoc Tests

Finding Which Pairs Differ

Tukey's HSD: Best for all pairwise comparisons — controls FWER
Bonferroni: α* = α/m — conservative; good for any planned comparisons
Scheffé: Controls FWER for all contrasts — most conservative
Kruskal-Wallis: Non-parametric alternative when normality fails

Tukey HSD|ȳᵢ − ȳⱼ| > q_{α,g,N−g} · √(MSE/n) → significant pair

Bonferroni adjusted αα* = α/m (m = number of comparisons)

· · ·

Two Factors & Interaction

Two-Way ANOVA & Interaction Effect

📐

Two-Way ANOVA

Model & Decomposition

Tests: (1) Main effect of A; (2) Main effect of B; (3) Interaction A×B. SST = SSA + SSB + SSAB + SSE. Interaction is most interesting — does the effect of A depend on the level of B? Plot interaction plots: parallel lines = no interaction; crossing lines = interaction present.

💡

Interaction

It Depends! — The Most Important Result

Significant interaction means the effect of fertiliser on yield DEPENDS on which crop variety is used. Cannot interpret main effects in isolation when interaction is significant. 😄 "Interaction is statistics saying: 'it depends' — and that's almost always the most scientifically interesting answer."

Two-way modelYᵢⱼₖ = μ + αᵢ + βⱼ + (αβ)ᵢⱼ + εᵢⱼₖ

SS decompositionSST = SSA + SSB + SSAB + SSE

F for interactionF_{AB} = MSAB/MSE ~ F((a−1)(b−1), ab(n−1))

· · ·

Experimental Designs

CRD · RBD · LSD & Factorial Designs

🎲

CRD

Completely Randomised Design

Treatments randomly assigned to all units with no restrictions. Simplest design — use when units are homogeneous. Analysis: one-way ANOVA. df_error = N−t. Disadvantage: if units are heterogeneous, MSE will be large and F-test will be weak.

🧱

RBD

Randomised Block Design

Group similar units into blocks; randomise treatments within blocks. Removes block variation from error → smaller MSE → more powerful F-test. Fisher's golden rule: "Block what you can, randomise what you cannot." df_error = (t−1)(b−1). Widely used in agricultural, medical, and industrial experiments.

🔲

LSD

Latin Square Design — Two-Way Blocking

Controls TWO nuisance variables (rows and columns) simultaneously. A t×t square where each treatment appears exactly once in each row and column. df_error = (t−1)(t−2). Assumes no three-way interaction between rows, columns, and treatments.

🔢

2ᵏ Factorial

k Factors at 2 Levels Each

All 2ᵏ combinations of k factors (each at low/high)
Estimates all main effects AND all interactions
Fractional 2^{k-p}: half/quarter fractions to reduce runs
Yates algorithm computes all effects efficiently
😄 "The 2ᵏ design: maximum information, minimum runs — the statistician's favourite meal."

CRD modelYᵢⱼ = μ + τᵢ + εᵢⱼ df_error = N−t

RBD modelYᵢⱼ = μ + τᵢ + βⱼ + εᵢⱼ df_error = (t−1)(b−1)

LSD modelYᵢⱼₖ = μ + ρᵢ + γⱼ + τₖ + εᵢⱼₖ df_error = (t−1)(t−2)

2² main effect AEffect A = [(y_a+y_ab) − (y_(1)+y_b)] / 2n

🌍 Bangladesh Agricultural TrialTesting 4 fertiliser treatments (t=4) on rice in 3 blocks (b=3) of similar soil fertility. RBD gives df_error=(4−1)(3−1)=6. Result: F=8.4 (p=0.014), η²=0.62 — treatments explain 62% of variance. Tukey post-hoc: Treatment D significantly outperforms A and B (p<0.05) but not C (p=0.12). Blocking removed soil-fertility variability, making the test sensitive enough to detect real treatment differences that a CRD might have missed.

STAT3201 · Hypothesis Testing

🎓 The Court of Statistics We assume H₀ is true (innocent until proven guilty) and ask: how surprising is our data if H₀ were true? If very surprising (small p-value), we reject H₀. 😄 "H₀ is like a stubborn professor — it won't budge unless the evidence is overwhelming. And even then, there's a chance you made a mistake (Type I error)." Key texts: Casella & Berger for theory; Lehmann & Romano for advanced testing.

Topics — STAT3201

Hypotheses, Decision Rules & Error Types Power Analysis & Sample Size Neyman-Pearson Lemma — MP Tests UMP Tests & Monotone Likelihood Ratio Likelihood Ratio Tests (LRT) p-values — Meaning & Misuse Common Parametric Tests — Quick Reference Non-Parametric Tests

Framework

Hypotheses, Decision Rules & Error Types

📖

H₀ and H₁

Setting Up the Test

H₀: Status quo / no effect — assumed true by default
H₁: What we're trying to demonstrate
Simple: Completely specifies distribution (μ=5)
Composite: Specifies a class (μ>5)
One vs two-sided: H₁: μ>μ₀ vs H₁: μ≠μ₀

⚖️

Error Types

Four Outcomes

✅ H₀ true, Don't reject: Correct (prob 1−α)
❌ H₀ true, Reject: Type I error α — false alarm
❌ H₀ false, Don't reject: Type II error β — missed detection
✅ H₀ false, Reject: Power = 1−β — correct detection

💡

The Tradeoff

α↓ → β↑ for Fixed n

Decreasing α (fewer false alarms) increases β (more misses) for fixed n. Only way to reduce both: increase n. Power = 1−β should be ≥ 0.80 in well-designed studies. 😄 "Demanding 99.9% confidence with n=5 is like demanding perfect night vision in complete darkness — physically impossible with so little data!"

⚠️

Key Asymmetry

H₀ and H₁ Are Not Equal

We control α directly. β depends on α, n, and the true parameter. We can NEVER "prove H₀" — only fail to reject it. "Not guilty ≠ innocent. Fail to reject H₀ ≠ H₀ is true."

Type I error αP(reject H₀ | H₀ true) — false positive; set before the test

Type II error βP(fail to reject H₀ | H₁ true) — false negative; depends on n, δ, σ

Power1 − β = P(reject H₀ | H₁ true) — ability to detect a real effect

Sample size (z-test)n = σ²(z_α + z_β)² / (μ₁−μ₀)²

· · ·

Optimal Tests

Neyman-Pearson Lemma · UMP Tests & LRT

🏆

N-P Lemma

Most Powerful Test for Simple H

For H₀:θ=θ₀ vs H₁:θ=θ₁ (both simple), the Most Powerful (MP) test at level α rejects H₀ when Λ(x) = L(θ₁)/L(θ₀) > k. The N-P Lemma derives the optimal rejection region from the likelihood ratio — no guessing needed. For Gaussian data, this recovers the z-test as optimal.

🎯

UMP & MLR

Composite Alternatives

UMP test: Most powerful test for EVERY θ∈H₁ — exists for one-sided hypotheses in exponential families
MLR (Monotone Likelihood Ratio): If L(θ₁)/L(θ₀) is monotone in some statistic T(x), then rejecting for large T gives the UMP test for H₀:θ≤θ₀
Normal, Poisson, Binomial — all have MLR in their natural parameter

💡

LRT — General Tests

Wilks' Theorem

LRT: Λ = L(θ̂₀)/L(θ̂) ∈ [0,1]. Wilks (1938): −2 ln Λ → χ²(r) under H₀ where r = number of restrictions. This makes LRT applicable to ANY hypothesis. The chi-square test of independence is a special case. Reject H₀ if −2 ln Λ > χ²_α(r).

N-P MP testReject H₀ if L(θ₁;x)/L(θ₀;x) > k (k: size-α critical value)

LRT statisticΛ = L(θ̂₀)/L(θ̂) (θ̂₀=restricted MLE; θ̂=unrestricted MLE)

Wilks' theorem−2 ln Λ →_d χ²(r) under H₀ (r = #restrictions)

· · ·

Most Misused

p-values — Meaning, Misuse & Parametric Tests

📖

What p IS

Correct Definition

p-value = P(T ≥ t_obs | H₀) = probability of data as extreme or more extreme than observed, assuming H₀ true. Small p → data surprising under H₀ → evidence against H₀. It is a continuous measure of evidence, NOT a binary pass/fail.

⚠️

What p is NOT

5 Common Misconceptions

❌ "P(H₀ is true)" — H₀ has no probability in frequentist stats
❌ "Probability results occurred by chance"
❌ "Probability results will replicate"
❌ Measures effect size — huge n can make trivial effects "significant"
✅ "How surprising is my data if H₀ were true?"

💡

Common Tests

Parametric Quick Reference

One-sample z: Z = (x̄−μ₀)/(σ/√n) ~ N(0,1)
One-sample t: T = (x̄−μ₀)/(s/√n) ~ t(n−1)
Paired t: T = d̄/(sD/√n) ~ t(n−1)
χ² GOF: χ² = Σ(O−E)²/E ~ χ²(k−1−p)
χ² independence: ~ χ²((r−1)(c−1))

📊

Non-Parametric Tests

When Assumptions Fail

Wilcoxon signed-rank: Non-parametric one-sample/paired t
Mann-Whitney U: Non-parametric two-sample t (ranks)
Kruskal-Wallis: Non-parametric one-way ANOVA
Spearman's ρ: Non-parametric correlation
⚠ Less powerful than parametric when assumptions hold — use as backup

p-value (two-sided)p = 2·P(T ≥ |t_obs| | H₀)

Decision ruleReject H₀ iff p < α (set α before the test!)

Mann-Whitney UU = n₁n₂ + n₁(n₁+1)/2 − R₁ (R₁ = rank sum of group 1)

Kruskal-Wallis HH = [12/N(N+1)] Σ Rᵢ²/nᵢ − 3(N+1) ~ χ²(g−1)

😄 The p-hacking Warning"If you torture your data long enough, it will confess to anything." — Ronald Coase. Running 20 tests and reporting only the p<0.05 result guarantees a false positive. Pre-register your hypotheses before seeing the data, report ALL analyses, and always report effect sizes alongside p-values. The replication crisis in psychology was largely caused by widespread p-hacking and selective reporting. Register your analysis plan first — commit before you look!

STAT4102 · Sampling Techniques

🎓 Why Sampling? "You don't need to eat the whole pot of soup to know if it's salty — one spoonful is enough, IF it's well stirred." That's sampling. 😄 The goal: make valid inferences about a population of N units by examining only n << N units, saving time, cost, and resources while maintaining accuracy.

Topics — STAT4102

Basic Concepts & Probability Sampling Simple Random Sampling (SRS) Stratified Random Sampling Systematic & Cluster Sampling Ratio & Regression Estimation PPS Sampling & Non-Sampling Errors

Foundations

Basic Concepts & Probability Sampling

📖

Key Terms

Sampling Vocabulary

Sampling frame: List of all N population units — must be complete and up-to-date
Sampling unit: The unit selected at each draw
Inclusion probability πᵢ: Probability unit i is selected
Design effect (DEFF): Ratio of actual variance to SRS variance

💡

Probability vs Non-Probability

Two Types of Sampling

Probability: Every unit has known, non-zero inclusion probability → valid inference possible. SRS, stratified, cluster, systematic.
Non-probability: Convenience, purposive, quota — no valid inference to population. Use only for exploratory work.

⚠️

Key Principle

Unbiasedness & Efficiency

Unbiased estimator: E(ȳ) = Ȳ on average
Efficiency: Smaller variance = more information per unit cost
Goal: choose design that minimises variance for given cost

· · ·

Baseline Design

Simple Random Sampling (SRS)

🎲

SRSWOR vs SRSWR

With vs Without Replacement

SRSWOR: Each unit selected at most once — more common; smaller variance
SRSWR: Units can repeat — simpler theory; larger variance
Finite population correction (FPC) = (1−f) = (1−n/N) — matters when n/N > 0.05

⚙️

Estimation

Mean, Total, Proportion

ȳ = (1/n)Σyᵢ — unbiased estimator of Ȳ
ŷ_total = Nȳ — unbiased estimator of total Y
p̂ = x/n — unbiased estimator of proportion P

Var(ȳ) — SRSWORV(ȳ) = (1−f)·S²/n where f=n/N, S²=Σ(yᵢ−Ȳ)²/(N−1)

Estimated Varv(ȳ) = (1−f)·s²/n where s²=Σ(yᵢ−ȳ)²/(n−1)

95% CI for Ȳȳ ± 1.96·√v(ȳ)

Sample size nn = N·z²S² / (N·e² + z²S²) (e=desired margin of error)

· · ·

Improved Efficiency

Stratified Random Sampling

🗂️

What it is

Divide & Sample

Divide population into L non-overlapping strata; take SRS within each stratum. Why? Reduces variance by removing between-stratum variation from the error. Always more efficient than SRS if strata are internally homogeneous.

⚙️

Allocation Methods

How Many from Each Stratum?

Proportional: nₕ = n·(Nₕ/N) — simple; good when σₕ similar
Optimal (Neyman): nₕ ∝ Nₕσₕ — minimises variance for fixed n
Cost-optimal: nₕ ∝ Nₕσₕ/√cₕ — accounts for variable cost per stratum

💡

When to Stratify

Good Stratification Criteria

Variable highly correlated with study variable Y
Administrative convenience (districts, regions, age groups)
Need separate estimates for subgroups (domains)
Oversampling rare subgroups for adequate representation

Stratified meanȳ_st = Σₕ Wₕȳₕ (Wₕ = Nₕ/N = stratum weight)

Var(ȳ_st)V(ȳ_st) = Σₕ Wₕ²(1−fₕ)Sₕ²/nₕ

Neyman allocationnₕ = n · (NₕSₕ) / Σₕ(NₕSₕ)

Proportional alloc.nₕ = n · Nₕ/N

🌍 Bangladesh HIES ExampleHousehold Income & Expenditure Survey stratifies by division (8) × urban/rural (2) = 16 strata. Neyman allocation samples more from Dhaka (large, variable) and less from Sylhet (small, homogeneous). Result: 40% lower variance than SRS of same total size — more accurate poverty estimates at lower cost.

· · ·

Practical Designs

Systematic & Cluster Sampling

📋

Systematic Sampling

Every kth Unit

k = N/n (sampling interval); select random start r ∈ {1,…,k}; then r, r+k, r+2k, …
Very easy to implement — just a list and arithmetic
Efficient when list is in random order (≈SRS)
⚠ Periodic pattern in list + periodic k = biased disaster!

🏘️

Cluster Sampling

Sample Groups, Not Individuals

Divide population into clusters; randomly select m clusters; survey ALL units in selected clusters
Cost-efficient when clusters are geographically compact
Less efficient statistically — units within cluster tend to be similar (intraclass correlation ρ)
DEFF = 1 + (b̄−1)ρ where b̄ = avg cluster size

💡

Two-Stage Cluster

Select Clusters, Then Sub-Sample

Stage 1: Select m PSUs (primary sampling units) with probability proportional to size. Stage 2: Select n SSUs within each selected PSU. Used in virtually all large national surveys (DHS, MICS, census post-enumeration). More flexible than single-stage cluster sampling.

Systematic interval kk = N/n (round to integer); sample: r, r+k, r+2k, …

Cluster mean estimatorȳ_cl = (1/m)Σᵢȳᵢ (ȳᵢ = mean of ith selected cluster)

Design effectDEFF = V(ȳ_cluster) / V(ȳ_SRS) = 1 + (b̄−1)ρ

Intraclass corr. ρρ = (MSB−MSW) / (MSB + (b̄−1)MSW) (between/within cluster)

· · ·

Auxiliary Information

Ratio & Regression Estimation

📈

Ratio Estimator

Using a Correlated Auxiliary Variable

If auxiliary variable X (known population total X̄) is highly correlated with Y: ȳ_R = R̂·X̄ where R̂=ȳ/x̄. Biased but often much lower MSE than ȳ. Best when ratio Y/X is more constant than Y itself — e.g., estimating crop yield per hectare.

⚙️

Regression Estimator

OLS-Based Improvement

ȳ_reg = ȳ + b̂(X̄−x̄) where b̂ = Σ(xᵢ−x̄)(yᵢ−ȳ)/Σ(xᵢ−x̄)². Always has smaller or equal variance than ȳ. More general than ratio estimator — doesn't require proportionality. Gain in efficiency ∝ ρ²(X,Y).

💡

When to Use Each

Ratio vs Regression vs SRS

Use ratio when Y∝X (passes through origin) and ρ>0.5
Use regression for general linear relationship
Both reduce variance when |ρ(X,Y)| is large
SRS if no good auxiliary variable available

Ratio estimatorȳ_R = (ȳ/x̄)·X̄ = R̂·X̄

Approx. Var(ȳ_R)V(ȳ_R) ≈ (1−f)/n · (Sᵧ² + R²Sₓ² − 2RSₓᵧ)

Regression estimatorȳ_reg = ȳ + b̂(X̄ − x̄)

Var(ȳ_reg)V(ȳ_reg) ≈ (1−f)Sᵧ²(1−ρ²)/n (always ≤ V(ȳ))

· · ·

Advanced & Errors

PPS Sampling & Non-Sampling Errors

⚖️

PPS Sampling

Probability Proportional to Size

Select PSUs with probability proportional to a size measure (number of households, land area). Larger clusters have higher selection probability. Combined with equal-probability sub-sampling within PSUs → self-weighting sample. Used in almost all national surveys.

⚠️

Non-Sampling Errors

Often Bigger Than Sampling Error!

Coverage error: Frame misses units (undercoverage of homeless, migrants)
Non-response: Selected units don't participate — can cause serious bias
Measurement error: Wrong answers due to question wording, recall, interviewer bias
Processing error: Data entry, coding mistakes
😄 "A perfectly designed sample with 40% non-response is worse than a simple convenience sample for many questions."

💡

Hansen-Hurwitz Estimator

PPS with Replacement

π_i = n·Mᵢ/M₀ (selection probability). Estimator: ȳ_HH = (1/n)Σ(yᵢ/πᵢ). Unbiased. Variance ∝ variation of yᵢ/πᵢ — good PPS reduces this variation dramatically compared to SRS for skewed populations (like business surveys).

PPS prob. of selectionπᵢ = n·Mᵢ / M₀ (Mᵢ=size of unit i, M₀=total size)

HH estimatorȳ_HH = (1/n)·Σᵢ(yᵢ/πᵢ) (unbiased)

Horvitz-Thompsonŷ_HT = Σᵢ∈s (yᵢ/πᵢ) (unbiased for any design)

STAT4106 · Categorical Data Analysis

🎓 Why Categorical Data Analysis? Most real-world outcomes are categorical — disease/no disease, vote/don't vote, pass/fail. You cannot use t-tests or ANOVA on counts. CDA provides the correct tools: chi-square tests for independence, odds ratios for effect size, logistic regression for prediction, and log-linear models for multi-way tables. As Agresti notes: "Categorical data analysis is arguably more important in practice than normal-theory methods."

Topics — STAT4106

Distributions for Categorical Data Contingency Tables & χ² Tests Measures of Association — OR, RR, φ Logistic Regression Log-Linear Models Ordinal Data & Matched Pairs

Foundations

Distributions for Categorical Data

📖

Key Distributions

Binomial, Multinomial & Poisson

Binomial(n,π): n independent trials, count successes. Foundation for proportions.
Multinomial(n; π₁,…,πk): n trials, k categories. Joint distribution of cell counts.
Poisson(μ): Independent cell counts — used in log-linear models

⚙️

Sampling Schemes

Poisson, Multinomial, Product-Multinomial

Poisson: Both margins random — all counts independent Poisson
Multinomial: Grand total n fixed; cell counts ~ Multinomial
Product-multinomial: Row totals fixed (prospective study); each row ~ Multinomial
χ² test gives same result for all three — convenient!

Binomial PMFP(Y=k) = C(n,k)·πᵏ·(1−π)^(n−k)

Multinomial PMFP(n₁,…,nk) = n!/(n₁!…nk!) · π₁^n₁·…·πk^nk

MLE of ππ̂ = y/n (sample proportion — unbiased, consistent)

· · ·

Core Tool

Contingency Tables & χ² Tests

📊

r×c Contingency Table

Cross-Tabulation

An r×c table cross-classifies n observations by two categorical variables (r rows, c columns). Cell count nᵢⱼ = observations in row i, column j. Marginal totals: nᵢ₊ (row), n₊ⱼ (column). Test: are the two variables independent?

⚙️

Pearson χ² Test

Testing Independence

H₀: rows and columns are independent (πᵢⱼ = πᵢ₊·π₊ⱼ)
Expected count: Eᵢⱼ = nᵢ₊·n₊ⱼ/n (under H₀)
χ² = Σ(nᵢⱼ−Eᵢⱼ)²/Eᵢⱼ ~ χ²((r−1)(c−1)) under H₀
⚠ Requires Eᵢⱼ ≥ 5 in all cells — use Fisher's exact if violated

💡

Likelihood Ratio G²

Alternative to χ²

G² = 2Σnᵢⱼ·ln(nᵢⱼ/Eᵢⱼ) ~ χ²((r−1)(c−1)). Also called the deviance. Preferred in log-linear model context — additive across hierarchical models. χ² and G² converge for large n; differ for small n.

⚠️

Fisher's Exact Test

Small Samples

For 2×2 tables with small expected counts: compute exact probability of observing table this extreme, conditioning on both margins fixed. p = C(n₁₊,n₁₁)·C(n₂₊,n₂₁)/C(n,n₊₁). No large-sample approximation needed.

Expected cell countEᵢⱼ = nᵢ₊·n₊ⱼ / n (under independence)

Pearson χ²X² = Σᵢⱼ(nᵢⱼ−Eᵢⱼ)²/Eᵢⱼ ~ χ²((r−1)(c−1))

Likelihood ratio G²G² = 2Σᵢⱼ nᵢⱼ·ln(nᵢⱼ/Eᵢⱼ) ~ χ²((r−1)(c−1))

dfdf = (r−1)(c−1) (for r×c independence test)

· · ·

Effect Size

Measures of Association — OR, RR & φ

📐

Odds Ratio (OR)

Most Important Association Measure

OR = (n₁₁·n₂₂)/(n₁₂·n₂₁) = (odds of outcome in group 1)/(odds in group 2). OR=1 means no association. OR>1 means higher odds in group 1. OR is the natural parameter for logistic regression and case-control studies. Does not depend on marginal totals — unlike RR.

⚙️

Relative Risk (RR)

Risk Ratio for Prospective Studies

RR = (n₁₁/n₁₊) / (n₂₁/n₂₊) = risk in exposed / risk in unexposed
More intuitive than OR when outcomes are common
Only valid when row totals are fixed (prospective/cohort design)
For rare outcomes: OR ≈ RR

💡

φ and Cramér's V

Symmetric Association Measures

φ = √(χ²/n) — for 2×2 tables; ∈ [−1,1]
Cramér's V = √(χ²/(n·min(r−1,c−1))) — for r×c; ∈ [0,1]
V=0: no association; V=1: perfect association

Odds RatioOR = (n₁₁·n₂₂) / (n₁₂·n₂₁) (2×2 table)

ln(OR) SESE[ln(OR)] = √(1/n₁₁ + 1/n₁₂ + 1/n₂₁ + 1/n₂₂)

95% CI for ORexp[ln(OR) ± 1.96·SE(ln OR)]

Relative RiskRR = (n₁₁/n₁₊) / (n₂₁/n₂₊)

Cramér's VV = √[χ²/(n·min(r−1,c−1))]

🌍 Bangladesh TB Study2×2 table: smokers vs non-smokers, TB vs no TB. OR=3.2 (95% CI: 1.8–5.7, p<0.001). Interpretation: smokers have 3.2 times the odds of TB compared to non-smokers. Since TB is rare (<5%), OR ≈ RR: smokers have approximately 3× the risk. This is statistically significant AND clinically meaningful — OR=3.2 is a strong association.

· · ·

Binary Outcomes

Logistic Regression

🔢

The Model

Logit Link Function

For binary Y∈{0,1}: log[π/(1−π)] = β₀ + β₁X₁ + … + βₖXₖ where π = P(Y=1|X). The logit link ensures predicted probabilities ∈ (0,1). Estimated by Maximum Likelihood Estimation (MLE), not OLS. Iteratively Reweighted Least Squares (IRLS) algorithm.

⚙️

Interpretation

Coefficients as Log-Odds

βⱼ = change in log-odds of Y=1 per unit increase in Xⱼ (others fixed)
exp(βⱼ) = odds ratio for 1-unit increase in Xⱼ — most interpretable
exp(βⱼ) > 1: higher Xⱼ → higher odds; < 1: lower odds
95% CI for OR: exp(βⱼ ± 1.96·SE(βⱼ))

💡

Model Fit

Assessing Goodness of Fit

Deviance: −2·log-likelihood; lower = better; compare nested models
Hosmer-Lemeshow test: Goodness of fit for grouped data
Pseudo R²: McFadden's, Nagelkerke's — analogue of R² (not identical!)
AUC-ROC: Discrimination ability — AUC>0.7 good; >0.8 excellent

⚠️

Common Extensions

Multinomial & Ordinal Logistic

Multinomial logistic: Nominal Y with >2 categories — g−1 logit equations vs reference
Ordinal logistic (PO model): Ordered Y — Proportional Odds: log[P(Y≤j)/P(Y>j)] = αⱼ−β'X
PO assumption: same βs for all cut-points — test with parallel lines test

Logit modellogit(π) = ln[π/(1−π)] = β₀ + β₁X₁ + … + βₖXₖ

Predicted probabilityπ̂ = exp(β̂'x) / [1 + exp(β̂'x)] = 1/[1+exp(−β̂'x)]

Odds RatioOR_j = exp(β̂ⱼ) — per unit increase in Xⱼ holding others fixed

Wald testz = β̂ⱼ/SE(β̂ⱼ) ~ N(0,1) (test H₀: βⱼ=0)

LR test (nested)G² = −2[ℓ(reduced) − ℓ(full)] ~ χ²(df_diff)

· · ·

Multi-Way Tables

Log-Linear Models

📦

What it is

Modelling Cell Counts

Log-linear models treat cell counts as Poisson: ln(μᵢⱼ) = λ + λᵢᴬ + λⱼᴮ + λᵢⱼᴬᴮ. All variables are response variables — no distinction between X and Y. Especially useful for 3+ way tables to model partial and conditional independence structures.

⚙️

Model Hierarchy

Saturated vs Parsimonious

Saturated: All interactions included; perfect fit; df=0 — useless for testing
[AB,AC,BC]: All 2-way interactions; no 3-way
[AB,C]: A and B interact; C independent of both
[A,B,C]: Complete independence of A, B, C
Select model by G² (deviance) and AIC

💡

Link to Logistic Regression

Equivalence Result

For a 2×J table (binary Y), the log-linear model [XY, X] is exactly equivalent to logistic regression of Y on X. The association parameter in the log-linear model = the logistic regression coefficient. This provides a unified framework for all categorical models.

Saturated 2-wayln(μᵢⱼ) = λ + λᵢᴬ + λⱼᴮ + λᵢⱼᴬᴮ

Independence modelln(μᵢⱼ) = λ + λᵢᴬ + λⱼᴮ (no interaction term)

MLE fitted countsμ̂ᵢⱼ = nᵢ₊·n₊ⱼ/n (independence model = Eᵢⱼ)

Model selectionAIC = G² − 2·df (choose model with lowest AIC)

· · ·

Special Topics

Ordinal Data & Matched Pairs

📊

Ordinal Association

Concordant & Discordant Pairs

Concordant pair: Both variables rank same direction
Discordant pair: Variables rank opposite directions
Gamma γ: (C−D)/(C+D) ∈ [−1,1] — ignores ties
Kendall's τb: Accounts for ties — preferred
Spearman's ρ: Correlation of ranks

⚙️

Matched Pairs (McNemar)

Paired Binary Data

n subjects measured twice (before/after) or matched pairs
Only discordant pairs (b and c) carry information about change
McNemar's test: χ² = (b−c)²/(b+c) ~ χ²(1)
Odds ratio for matched pairs: OR = b/c

💡

Cochran-Mantel-Haenszel

Controlling for Confounding

CMH test: test association between X and Y controlling for a third variable Z (stratification). Combines evidence across K strata. Common OR estimate: OR_MH = Σₖ(aₖdₖ/nₖ) / Σₖ(bₖcₖ/nₖ). Essential for removing confounding in observational studies.

McNemar χ²χ² = (b−c)²/(b+c) ~ χ²(1) (b,c = discordant cell counts)

Matched OROR = b/c (ratio of discordant pairs)

Gammaγ = (C−D)/(C+D) (C=concordant, D=discordant pairs)

CMH statisticχ²_MH = [Σₖ(aₖ−μₖ)]² / Σₖσₖ² ~ χ²(1)

STAT4104 · Research Methodology

🎓 What is Research Methodology? Research methodology is the systematic framework for conducting scientific inquiry — it answers "HOW do we find out what we want to know?" It covers study design, measurement, data collection, analysis strategy, and reporting. As Saunders et al. describe it: "Research methodology is the theory of how research should be undertaken." 😄 "Good methodology won't save bad ideas, but bad methodology will ruin good ones."

Topics — STAT4104

Nature & Types of Research Research Design & Paradigms Literature Review & Hypothesis Formulation Measurement, Scales & Questionnaire Design Validity, Reliability & Data Quality Data Collection Methods Research Ethics Report Writing & Dissemination

Foundations

Nature & Types of Research

📖

What is Research?

Systematic Inquiry

Research is a systematic, controlled, empirical investigation of natural phenomena guided by theory and hypotheses about the relationship between variables. It is not just "searching the web" — it requires rigour, replicability, and transparency.

⚙️

Types by Purpose

Basic vs Applied vs Action

Basic/Pure: Advances knowledge without immediate application — testing theory
Applied: Solves specific practical problems — policy evaluation, product testing
Action research: Researcher is also a participant; improves practice while studying it

💡

Types by Approach

Quantitative vs Qualitative vs Mixed

Quantitative: Numbers, tests, generalisation — large n, structured data
Qualitative: Meaning, context, depth — interviews, observation, small n
Mixed methods: Combines both — sequential, concurrent, or embedded designs

✅

Types by Time

Cross-Sectional vs Longitudinal

Cross-sectional: One point in time — snapshot; cheap but no causation
Longitudinal: Same subjects over time — tracks change; expensive but causal insight
Retrospective: Past data — case-control; recall bias risk
Prospective: Follow forward — cohort; gold standard for temporal causation

· · ·

Study Design

Research Design & Paradigms

🔭

Research Paradigms

Positivism, Interpretivism & Pragmatism

Positivism: Objective reality exists; can be measured; deductive; quantitative
Interpretivism: Reality is socially constructed; context matters; inductive; qualitative
Pragmatism: "Whatever works" — mixed methods; research question drives method choice
Most statistics students work within a positivist paradigm

⚙️

Experimental Design

RCT — Gold Standard

Randomised Controlled Trial (RCT): Random assignment to treatment/control → allows causal inference
Quasi-experiment: No randomisation but comparison group exists (DID, RDD)
Observational: No manipulation — correlation only (unless IV, matching used)

💡

Causal Inference

Why RCTs Rule

RCT removes selection bias — treatment and control groups are identical on average (observed AND unobserved). Average Treatment Effect (ATE) = E[Y(1)−Y(0)]. Without randomisation, Y(1) and Y(0) differ systematically — we observe only one potential outcome per person (fundamental problem of causal inference).

🌍 Bangladesh Microfinance RCTBandhan microfinance RCT (Banerjee et al.): randomly assigned microcredit to some villages; compared income/consumption 2 years later. ATE estimate = positive but modest income effect. RCT design means we can confidently attribute this to the credit program — not to pre-existing differences between borrowers and non-borrowers. Landmark example of rigorous impact evaluation.

· · ·

Before Data Collection

Literature Review & Hypothesis Formulation

📚

Literature Review

Why Review the Literature?

Identifies what is already known — avoid duplicating work
Locates gaps your research fills
Provides theoretical framework and conceptual models
Guides appropriate methodology and instruments
Databases: PubMed, Web of Science, Scopus, Google Scholar, JSTOR

🎯

Hypothesis Formulation

Good Hypotheses

Stated as relationship between two or more variables
Testable with available data and methods
Grounded in theory and prior literature
Null H₀: No effect/relationship — what we statistically test
Directional (one-sided): Stronger theory → directional; exploratory → two-sided

💡

PICO Framework

Structuring Research Questions

Especially in health research: Population — Intervention/Exposure — Comparison — Outcome. Example: Among Bangladeshi children under 5 (P), does exclusive breastfeeding for 6 months (I) compared to mixed feeding (C) reduce stunting rates (O)? Clear PICO prevents vague, unanswerable questions.

· · ·

Measurement

Measurement, Scales & Questionnaire Design

📏

Scales of Measurement

Nominal · Ordinal · Interval · Ratio

Nominal: Gender, religion, blood type — categories only; mode appropriate
Ordinal: Education level, satisfaction rating — ranked; median appropriate
Interval: Temperature, IQ — equal intervals, no true zero; mean appropriate
Ratio: Income, weight, height — true zero; all measures appropriate

📝

Questionnaire Design

Golden Rules

Each question measures ONE thing only (no double-barrelled questions)
Use simple, clear language appropriate for target population
Avoid leading questions ("Don't you agree that…?")
Order: easy/non-sensitive first; sensitive/demographics last
Pilot test with 10–20 people before full deployment

💡

Response Scales

Likert, Semantic Differential & VAS

Likert scale: 1–5 or 1–7 agreement scale; treat as ordinal (or approximately interval for ≥5 points)
Semantic differential: Bipolar adjectives (good–bad, fast–slow) on 7-point scale
VAS (Visual Analogue Scale): 0–100mm line; continuous; good for pain, intensity

· · ·

Quality Assurance

Validity, Reliability & Data Quality

🎯

Validity

Are We Measuring What We Intend?

Content validity: Items cover the full domain
Construct validity: Measures the theoretical construct (convergent + discriminant)
Criterion validity: Correlates with gold standard (concurrent + predictive)
Internal validity: Study design allows causal inference (no confounding)
External validity: Results generalise to other populations/settings

🔁

Reliability

Consistency of Measurement

Test-retest reliability: Same result on repeated measurement (Pearson r)
Inter-rater reliability: Different raters agree (Cohen's κ)
Internal consistency: Items in scale hang together (Cronbach's α ≥ 0.7)

💡

Validity vs Reliability

The Dartboard Analogy

Reliable but not valid: all darts in tight cluster but hitting the wrong target. Valid but not reliable: darts scattered but centred on the right target. Reliable AND valid: tight cluster on the correct target. Reliability is necessary but not sufficient for validity.

Cronbach's αα = (k/(k−1)) · [1 − Σσᵢ²/σ²_total] (k=items; ≥0.7 acceptable)

Cohen's κκ = (P_o − P_e)/(1 − P_e) (P_o=observed agreement; P_e=expected by chance)

κ interpretation0.0–0.2: slight; 0.21–0.4: fair; 0.41–0.6: moderate; 0.61–0.8: substantial; >0.8: almost perfect

· · ·

Field Work

Data Collection Methods & Ethics

📋

Collection Methods

Survey, Interview, Observation, Secondary

Self-administered survey: Cheap, large scale, no interviewer bias — but low response rate
Face-to-face interview: High response, complex questions possible — expensive, interviewer bias
Telephone/CATI: Moderate cost, fast — coverage bias (no mobile?)
CAPI: Computer-assisted personal interview — error reduction, skip patterns automated
Secondary data: HIES, DHS, census, administrative records

🛡️

Research Ethics

Core Principles

Informed consent: Voluntary participation with full information
Confidentiality: Individual data not disclosed; anonymise outputs
No harm: Physical, psychological, social harm must be minimised
Honesty: No fabrication, falsification, or plagiarism of data/results
IRB/Ethics Board approval: Required for human subjects research

📊

Report Writing

Structure of a Research Report

Abstract: Background, objective, methods, results, conclusions (≤250 words)
Introduction: Problem, rationale, objectives, hypotheses
Methods: Study design, population, sample, instruments, analysis plan
Results: Tables, figures, statistical findings — no interpretation
Discussion: Interpret, compare with literature, limitations, implications
Conclusion: Answer the research question; recommendations

😄 Ethics Reminder"In research ethics, the three golden rules are: (1) Do not harm participants, (2) Do not lie to participants, (3) Do not lie about participants in your results. The fourth, unofficial rule: (4) Do not ONLY add your supervisor's name to the author list without their contribution — honorary authorship is a form of research misconduct." Always get IRB clearance before data collection, not after — retroactive approval doesn't exist!

📚 Reference Books

[1]

Probability and Statistical Inference

Hogg, R.V., Tanis, E.A., & Zimmerman, D.L. — John Wiley & Sons · 9th Ed.

[2]

Introduction to Mathematical Statistics

Hogg, R.V., McKean, J.W., & Craig, A.T. — Pearson · 7th Ed.

[3]

Introduction to Theory of Statistics

Mood, A.M., Graybill, F.A., & Boes, D.C. — McGraw-Hill · 3rd Ed.

[4]

A First Course in Statistics

McClave, J.T. & Sincich, T. — Pearson / Prentice Hall · 13th Ed.

[5]

Applied Linear Statistical Models

Kutner, M.H., Nachtsheim, C.J., Neter, J., & Li, W. — McGraw-Hill · 5th Ed.

[6]

Introduction to Linear Regression Analysis

Montgomery, D.C., Peck, E.A., & Vining, G.G. — John Wiley & Sons · 5th Ed.

[7]

Basic Econometrics

Gujarati, D.N. & Porter, D.C. — McGraw-Hill · 5th Ed.

[8]

Introductory Econometrics: A Modern Approach

Wooldridge, J.M. — Cengage Learning · 7th Ed.

[9]

Applied Multivariate Statistical Analysis

Johnson, R.A. & Wichern, W.D. — Pearson Prentice Hall · 6th Ed.

[10]

Statistical Inference

Casella, G. & Berger, R.L. — Duxbury Press / Cengage · 2nd Ed.

[11]

Testing Statistical Hypotheses

Lehmann, E.L. & Romano, J.P. — Springer · 3rd Ed.

[12]

Sampling Techniques

Cochran, W.G. — John Wiley & Sons · 3rd Ed.

[13]

An Introduction to Categorical Data Analysis

Agresti, A. — John Wiley & Sons · 3rd Ed.

[14]

Categorical Data Analysis

Agresti, A. — John Wiley & Sons · 3rd Ed.

[15]

Design and Analysis of Experiments

Montgomery, D.C. — John Wiley & Sons · 10th Ed.

[16]

The Design of Experiments

Fisher, R.A. — Oliver & Boyd · 9th Ed.

[17]

Research Methods for Business Students

Saunders, M., Lewis, P. & Thornhill, A. — Pearson · 8th Ed.

[18]

Survey Sampling

Kish, L. — John Wiley & Sons · Classic Ed.

Undergraduate CourseLearning Notes

The Science of Data

Descriptive vs Inferential

Applications

Cautions

Early Beginnings

17th–20th Century

Must-Know Terms

Classification of Data

Why We Use Statistics

Why It Matters

What Statistics Cannot Do

Original Data (First-hand)

Existing/Published Data

Primary vs Secondary

Data Processing Pipeline

Organising Raw Data

The "Centre" of Data

The Big Five

Right Tool, Right Job

Common Mistakes

Quantifying Variability

Absolute & Relative

Absolute vs Relative

Relative Change Measure

Key Methods

Applications

Data Over Time

Decomposition (TSCI)

Trend Estimation

Measuring Association

Types of Correlation

How to Compute

Correlation ≠ Causation

Line of Best Fit

Minimising Residuals

Regression Coefficients & r

Non-numerical Characteristics

Statistical Tools

When Are Attributes Related?

Asymmetry of Distribution

Peakedness (Tailedness)

Joint Distribution of (X, Y)

Components

Bridge to Multivariate

Sets & Notation

Union, Intersection, Complement

Algebra Laws

Complement of Unions

Experiment, Outcomes, Events

Foundations of Probability

4 Approaches

Multiplication Rule

Ordered Arrangements

Unordered Selections

Probability Given Information

When Knowledge Changes Nothing

Reversing Conditional Probability

Averaging Over Causes

Mapping Outcomes to Numbers

Two Types of RVs

Summary Measures

Between Two RVs

Single Trial — Success/Failure

n Independent Bernoulli Trials

Rare Events in Time/Space

Waiting for First Success

The Bell Curve — Most Important

Time Until First Event

Equal Probability Everywhere

Flexible Family Distributions

The Population Model

LINE Assumptions

Meaning of Coefficients

Regression Applications

Minimise Sum of Squared Errors

BLUE Estimators

SST = SSR + SSE

Is X a Significant Predictor?

For β₁ and Mean Response

Undergraduate Course
Learning Notes